
B 



(12) INTERNATIONAL APPLICATION PUBLISHED UNDER THE PATENT COOPERATION TREATY (PCT) 



(19) World Intellectual Property Organization 

International Bureau 

(43) International Publication Date 
1 August 2002 (01.08.2002) 




PCT 



(10) International Publication Number 

WO 02/059348 A2 



(51) International Patent Classification 7 : C12Q 

(21) International Application Number: PCT/US02/02564 

(22) International Filing Date: 26 January 2002 (26.01.2002) 



(25) Filing Language: 

(26) Publication Language: 



English 
English 



(30) Priority Data: 

60/264,403 



26 January 2001 (26.01 .2001) US 



(71) Applicant (for all designated States except US): TECH- 
NOLOGY LICENSING CO. LLC [US/US]; Suite 200, 
3205 Harvest Moon, Palm Harbor, FL 34683-2127 (US). 

(72) Inventors; and 

(75) Inventors/Applicants (for US only): FOX, George, E. 
[US/US]; 3802 Holder Forest Dr., Houston, TX 77088- 
5001 (US). WILSON, Richard, C, III [US/US]; 5147 



Birdwood Rd., Houston, TX 77096 (US). ZHANG, Zhen- 
dong [CN/US]; 4393 Wheeler St. #1, Houston, TX 77004 
(US). 

(74) Agent: WILLSON, Richard, Coale, Jr.; Technology Li- 
censing Co. LLC, Suite 200, 3205 Harvest Moon, Palm 
Harbor, FL 34683-2127 (US). 



(81) 



(84) 



Designated States (national): AE, AG, AL, AM, AT, AU, 
AZ, BA, BB, BG, BR, BY, BZ, CA, CH, CN, CO, CR, CU, 
CZ, DE, DK, DM, DZ, EC, EE, ES, Ft, GB, GD, GE, GH, 
GM, HR, ITU, ID, IL, IN, IS, JP, KE, KG, KP, KR, KZ, LC, 
LK, LR, LS, LT, LU, LV, MA, MD, MG, MK, MN, MW, 
MX, MZ, NO, NZ, OM, PH, PL, PT, RO, RU, SD, SE, SG, 
SI, SK, SL, TJ, TM, TN, TR, TT, TZ, UA, UG, US, UZ, 
VN, YU, ZA, ZM, ZW. 

Designated States (regional): ARIPO patent (GH, GM, 
KE, LS, MW, MZ, SD, SL, SZ, TZ, UG, ZM, ZW), 
Eurasian patent (AM, AZ, BY, KG, KZ, MD, RU, TJ, TM), 
European patent (AT, BE, CH, CY, DE, DK, ES, FI, FR, 
GB, GR, IE, IT, LU, MC, NL, PT, SE, TR), OAPI patent 

[Continued on next page J 



(54) Title: METHODS FOR DETERMINING THE GENETIC AFFINITY OF MICROORGANISMS AND VIRUSES 



/h 



v.. 



.7 



< 

00 
TT 

m 
O 



(57) Abstract: Selecting which sub-se- 
quences in a database of nucleic acid such as 
16S rRNA are highly characteristic of partic- 
ular groupings of bacteria, microorganisms, 
fungi, etc. on a substantially phylogenectic 
tree. Also applicable to viruses comprising 
viral genomic RNA or DNA. A catalogue 
of highly characteristic sequences identified 
by this method is assembled to establish the 
genetic identity of an unknown organism. 
The characteristic sequences are used to 
design nucleic acid hybridization probes 
that include the characteristic sequence or 
its complement, or are derived from one or 
more characteristic sequences. A plurality 
of these characteristic sequences is used in 
hybridization to determine the phylogenetic 
tree position of the organism(s) in a sample. 
Those target organisms represented in the 
original sequence database and sufficient 
characteristic sequences can identify to the 
species or subspecies level. Oligonucleotide 
arrays of many probes are especially 
preferred. A hybridization signal can 
comprise fluorescence, chemi luminescence, 
or isotopic labeling, etc.; or sequences in a 
sample can be detected by direct means, e.g. mass spectrometry. The method's characteristic sequences can also be used to design 
specific PCR primers. The method uniquely identifies the phylogenetic affinity of an unknown organism without requiring prior 
knowledge of what is present in the sample. Even if the organism has not been previously encountered, the method still provides 
useful information about which phylogenetic tree bifurcation nodes encompass the organism. 




// 




\\ 
• 1 


StonNamc Prok2 




slrartName Ptokl 


MIName Pmkaryote2 




fiiJJNamc Prokaryotcl 


rsVaJid yes. 




IsVelW no 


teMatehod no 




ijMatchad no 


lesfNumbpj- 2 




teafNumbor 1 


ire with three leaf nodes 


Note that a parent nod 


t has two pointers to its 



child nodes and each child node has a pointer back to its parent 



WO 02/059348 A2 1 Bill I1IHIII U IHIII Hill HII I II III Hill Bill ilHI Hill Hill UN IIIHI III! IH1 Ml 

(BF, BJ, CF, CG, CI, CM, GA, GN, GQ, GW, ML, MR, For two-letter codes and other abbreviations, refer to the "Guid- 
NE, SN, TD, TG). ance Notes on Codes and Abbreviations " appearing at the begin- 

ning of each regular issue of the PCT Gazette. 

Declaration under Rule 4.17: 

— of inventorship (Rule 4. 1 7(iv)) for US only 

Published: 

— without international search report and to be republished 
upon receipt of that report 



WO 02/059348 



PCT7US02/02564 



METHODS FOR DETERMINING THE GENETIC AFFINITY OF MICROORGANISMS AND 

VIRUSES 

Cross-References to Related Applications: This Application claims priority of provisional application 
5 ^0/264,403 filed 01/26/2001. 

Statement Regarding Federally Sponsored Research or Development:. The research was funded in part 
by grants to R. C. W. and G.E.F. from NASA through the National Space Biomedical Research Institute. 
Results CD Appendix^ Certain_resulta obtained by the invention are set forth on the CD which is enclosed 
10 as apart of the application under 37. Code of Federal Regulations Section 1.58. 

Program Code Appendix: The computer programs and subroutines of the invention are set forth on the 
CD, which is enclosed as apart of the application under 37 Code of Federal Regulations Section 1.96. 

Copyright: Contained herein is material that is subject to international copyright protection. The copyright 
15 owner has no objection to the facsimile reproduction of the patent disclosure by any person as it appears in 
the Patent and Trademark Office patent files or records, but otherwise reserves all rights to the copyright 
whatsoever. 

L BACKGROUND OF Tfflr INVENTION 
20 L Field of the Invention: 

The present invention, relates-tet-ther gftneraLfield of biochemical assays and separations, and to 
apparatus for their practice, generally classified in U.S. Patent Class 435/6. 



25 



JL Descrfction of the Prior Art 



Unlike multicellular organisms, bacteria and simple eukaryotic rntcr ©organisms have very limited 
morphological diversity and typically do not leave a significant fossil record It therefore was initially very 
difficult te develop a classificatien_system,. which reflects actual genetic relationship. Instead, classic 
bacterial taxonomic methods, such as morphology and carbon source utilization were used to classify 

3Q bacteria in a deterministic way. T he-goal was. to. develop a hierarchy of tests that ultimately could 

reproducibiy assign a consistent name to an unknown isolate. When organisms gave very similar results on 
the various tests they would- ultirnateiy beassignedto the same species regardless of actual genetic 
relationship. Thus, organisms were sometimes grouped together that were fundamentally very different 
This situation changetLdratnatkaHy in the 1970's due to the pioneering work of Carl Woese and 

35 his colleagues. In order to obtain a genotypic classification, methods based on molecular sequence analysis 
of ribosomal RNA (rRNA). were. developed.. The rRNAs ofFeredtheadvantage of being found in all 
organisms and the equivalent molecules could be readily isolated and purified from essentially any 
organism The large ribosomal RNAs-vary mkngtkdepending on the organism an4 therefore have 



WO 02/059348 PCT/US02/02564 

different names, e.g. 16S rRNA, 18S rRNA etc, depending on the organism under consideration. To avoid 
this difficulty, the terminology: eraaii Wmnit ttMA (SSTTT*NA>*rtH lar^fl ciiK^ipit^HA (LSURNA) is used 
to specify any of the RNAS belonging to each class. Among the rRNAs, 5S rRNA with approximately 120 
nucleotides was thought iabe_toa short tabe useful and the LSLLRNA^ (23S xRJ^A in bacteria), would 
5 have been far more difficult to work with. Attention therefore focused on the SSU RNA (16S rRNA in 
bacteria). 16&rRNA xk. a major-component-of the bacteriatsmalLribosomal subunit It consists of 

approximately 1,550 ribonucleotides in Escherichia colt and has an intricate s^ondary structure featuring 
extensive intrachain Hac^ pairing Ihe.detailed thrp^Himftng innai fhTrting rsf i ^ rRNA in the Thermits 
aquattcus 30S ribosomal subunit has recently been determined by X-ray crystaiiQgraphy. As a major 

10 component of the ribosome^J^ rRNA-interacts with 23S rRNAtoestablish the overall geometry of the 
ribosome and is directly involved in the initiation of protein biosynthesis by ribosomes. 

When Woese.first.hftgan using IfiS.rRNA in his evoliirionary.studies.it wap not technically 
feasible to sequence the entire RNA Therefore a characterization approach was developed (Uchida et ql^ 
1974) in which the 16S.rKNA was fragmented by thejiuclease, ribonuclease T L This enzyme cleaves the 

15 RNA at guanosine (G) residues and thereby reduced the RNA to a collection of fragments of various 

lengths with a single terminal CLThe noiinG portion of theAagnientwas.then sequenced The lists of all 
such fragments obtained from a single RNA was referred to as a catalog. Catalogs of ribonuclease Ti 
fragments from 16S rRNAs isolated from a .variety.xif or^ui,suis> werexompared to ^ne another and cluster 
analysis was used to construct a tree of relationship between the various bacteria (Fox etal. y \ 977). By 

20 1980, enough data c£-thi&,type had accumulated that it was-possible-to construct the :prst trees that seriously 
attempted to identify the actual historical relationships between the various types of bacteria (Fox et al. , 
1980; Woese, 1987). 

Later, as sequencmgtechnology was improved, it became possible to sequence suid compare entire 16S 
rRNAs. 

25 In an effort tabetter, understand the.tcee.produced.by cluster analysis, an alternative means of 

examining relationships known as "signature analysis" was developed (Woese.ef a/., 1980). It was 
observed that certain of the ribonuclease T x fragments were only found in a subset of the 16S rRNA 
catalogs. Frequently there was more than one such sequence that was uniquely found in the same group of 
organisms. Thus, the term "signature" was introduced as follows: "a set of oligonucleotides that is 

30 characteristic of (unique to) a group of organisms defines that group and is a "signature" for the group". 

These signatures suggested that there was a relationship between the organisms in tljie group and so the tree 
was examined to see if the tree-generating algorithm had in fact found the expected relationship. 

This process of checking the reasonableness of trees produced from the cataloging data was 
employed on several occasions (Woese et al. 9 1980; Woese et al. 9 1984;.McGill et al. 9 1986). In its final 

35 rendition, (McGill et al. , 1986) the notion of. a. signature .quality index, that could calculated for every 
individual RNAse Ti oligonucleotide was introduced as a means of formalizing the extent to which there 
was or was not a signature for each branch in the tree. 



2 



WO 02/059348 PCT/US02/02564 

Today, comparison of 16S rRNA sequences is widely used to establish the genetic relationship 
between bacteria. A typical approach is to amplify and sequence I6S rDNA from various prokaryotic 
organisms. The resulting sequences are aligned with other 16S rRNA sequences and an appropriate 
method, e*g. maximum likelihood, is used to construct a tree that reflects likely historical relationships. 
5 Several public databases exist containing complete and partial small subunit rRNA sequences. For 

example, release 8 of the RDP database (Maidak et al., 2000) includes data for the srnali subunit RNA from 
over 16,000 bacteria, eukaryotes, plastids and mitochondria 

As Woese's work became well known it began to be appreciated that rRNA, might be useful in 
detecting the presence of a target organism in a test sample. Thus, in 1980 Kohne applied for patents (US 

10 patent 4,851,330 granted 25. July>. 1989 and 5,288,611 granted2/22/1994).the. essence of which is that a 

nucleic acid probe that is complementary to the rRNA of a specific target can be used' to detect the presence 
of that target This core approach has been widely used in microbial identification with probes usually 
being devised by sequence comparison rather than Kohne' s preferred embodiment that was subtractive 
hybridization. Several commercial products rely on this approach. 

1 5 The invention described here provides a novel approach for rapidly determining the genetic 

affinity of organisms. in a test sampje. The invention's methodology is far more general than the 
specifically targeted tests of the Kohne approach, and faster and more convenient than detailed sequencing 
of the rRNAs or their encodingDNA, The method of this iiwention is currently most readily utilized with 
1 6S rRNA sequence data but can be adapted to other data sets such as rRNA .spacers, RNAse P RNA, 

20 genomic DNA or RNA of viruses^eta One beginajpy defmingjnicrohial groups within a phylogenetic tree 
that includes the organism range of interest, e.g. all bacteria for example. Then a set of characteristic 
oligonucleotides, each of which identifies a group in the phylogenetic tree,, is determined according to a 
newly developed algorithm of the invention. This set of signature oligonucleotides is utilized in a 
hybridization experiment, e.g^a DNA microarray,.the results of which are then use<ji to quickly identify the 

25 phylogenetic neighborhood of a problematic bacterium, or other microorganism These hybridization 

experiments can be miniaturized so that minimally trained personnel can readily conduct them in difficult 
environments. The set of signature oligonucleotides can be updated and redesigned as our knowledge of the 
true genetic affinity between. known orgauusms-improves.. In many.cases, the hybridization array will be 
able to determine the genetic affinity of multiple organisms in a sample in one experiment If the organism 

30 turns out to be a previously JoiawiLorganism, itsidentity, caa4»xdetexmined^tathe-^pecies level if suitable 
signature oligonucleotides are included in the hybridization. Under some circumstances, the signature 
sequences can also be used in assays in which detection does not rely on hybridizatioa 

Problem Solved by the Invention :The Kohne patents (below) teach methods to utilize probes to detect 
35 specific predeterminecLorganisms or gronps>o£ organisms. Thus, the *6LLpatent teaches us how to 

determine if a particular species of organism is or is not present in a test sample. The '330 patent teaches us 
how to detect specific groups of organisms as well as individual organisms. It is sornewhat limited, 



3 



WO 02/059348 PCT/US02/02564 



however, in that the probes under this invention are obtained by selection; i.e. subtractive hybridization. 
Others have subsequently demonstrated thcability tadetectspecific groups using probes based on 
sequence comparisons. 

5 It is implicit in alL these. prio^arU?fejences4hatx)ne knowswhatxme. is^lookin^f or. Thus, a prior art test 
can be specifically designed for detecting Legionella. However, this is not always what is needed, e.g. a 
quick response might be_necessary to, respond tn an outbreak, of a^preyiously unknown transmissible 
microbial disease. Perhaps even more to the point in this day and age, a terrorist could bioengineer a 
normally harmless organism to cany a gene, that resultsin production.^ mieadly toxin. The resulting 

10 organism would have properties not normally associated with the bacterium that carries the toxin gene. 

Indeed, the organism itself might he from a previously unknown genus, Similarly, th^re are instances where 
work is done in remote locations such as the Antarctic or on the International Space Station where one has 
extremely limited diagnostic crapahilit^available^Even in. standard mpdiraj practice microbial 
identification is needlessly cumbersome in that many alternative specialized tests are now used to identify 

15 the presence of the various Vnrwyn pgfo^gf as, Tn all <if these cases^he ability tagen^tically characterize 

and hence identify what organisms or viruses are present in a test sample with a single universal test system 
would be invaluable. The invention provides this.badly.needed solutimi in a very general way. 

References: 

20 

Fox, GE, Pecfaman, KR, and Woese, €R (1977) Comparative cataiogmg of 16S Ribosomal Ribonucleic 
Aci&Molecular Approach to Prokaryotic Systematics. Into. J. Syst Bacteriol. 27:44-57 (1977). 

Fox GE, Stackebrandt E, Hespel RB, Gibson J, Maniloff J, Dyer TA, Wolfe RS, Balch WE, Tanner RS, 
25 Magrum L J, Zablen LB, Blakemore R, Gupta R, Bonen L, Lewis BJ, Stahl DA, Luehrsen KR, Chen KN, 
and Woese CR (1980) The Rhylogpny of r^okaryotes. Science 209:457-463. 

Kohne, David E.; Method for detecting identifying and quantitating organisms and viruses; US Patent 
5,288,6 1 1 granted 22 Feb 1 994 which claims: "I. A method for detecting the presence of a species of 
30 organism comprising a ribosomal nucleic acid sequence, in a test sample, comprising the steps of: 

contacting ribosomal nucleic acid from said test sample with a nucleic acid probe a^le to hybridize to only 
a portion of said ribosomal nucleic acid sequence of said organism, incubating said probe and said 
ribosomal nucleic acid obtained from said test sample under specified 

35 Kohne, David E.; Method for detection, identification and quantitation of non-viral organisms; US Patent 
4,851,330 granted 25 July 1989 which claims: "A method for detecting the presence in a test sample of any 
non-viral organisms belonging to a group, said group consisting of at least one W less than all non-viral 



4 



WO 02/059348 PCT/US02/02564 

organisms, which comprises: (a) bringing together any test sample rRNA and a nucleic acid probe, said 
probe having been selected to be. sufficiently complementary to. hybridize- to one Cjr more rRNA subunit 
subsequences that are specific to said group of non-viral organisms and to.be shorter in length than the 
rRNA subunit to which said probe, hybridizes; (b) incubating the probe and any tesjt sample rRNA under 
5 specified hybridization conditions such that said probe hybridizes to the rRNA of. said group of non-viral 
organisms and does not deteclablyJ^aridize-tojRNA from other nonrviraLcjganisms; and, (c) assaying for 
hybridization cf said probe to any test sample rRNA 

Maidak HL, Cole JR, I.ilhnrn TG, Parker CXJr^ Saxman PR, StrjedwidcJM. Garrity GM, Li B, Olsen GJ, 
10 Pramanik S, Schmidt TM, and Tiedje JM (2000) The RDP (Ribosomal DatabaseProject) continues. 
Nucleic ^cids Res. 28: 173-174. 

McGill TR, Jurka J, Sobieski JM, Pickett MH, Woese CR, and Fox (1986) Characteristic 
archaebacterial 16S rRNA oligonucleotides. Syst Appl. Microbiol. 7:194-197. 

15 

Uchida, T., Bonen, L., Schanp, HW, T.ftwis, BI, Zablen, I, , and Wdese !r Xl.(19.74) The use of ribonuclease 
U 2 inRNA Sequence Determination. J. Molec. Evol. 3:63-77. 

Woese CR (1987) Bacterial evolution. Miaybiol. Rev. 51:221-271. 

20 

Woese CR, ManilofT J, and^ZablenIJ^(198Q).Phylc^net^ mycoplasmas. Proc. Natl. Acad 

Sci. USA 77:494-498. 

Woese CR, Starhg hranrit E, W Aichnrg Wfi Palter T3T a A/fariigan TVTT Fftwler VJ y TTahfl CM, Blanz P, Gupta 

25 R,NealsonKH,andFox GE (1984) The phylogeny of purple bacteria: the alpha-subdivision. System. Appl. 
Microbiol. 5^15-326. 

30 SUMMARY OR TEE INVENTION 

Applicants' method is summarized as follows: 

A. Establish or otherwise obtain, a nucleic acid sequencc-d a t abase, of the equivalent nucleic acid from 
35 a variety of organisms. It is best to quality control the database; selecting .sequences, which are complete 
and lack unknown gpgmpntc in ih* rggirm gf intern ^ discarding the rest Atiy of a variety cf nucleic acid 
sequences is potentially useful At present the substantial amount cf sequence information available for 



5 



WO 02/059348 PCT/US02/02564 

xRNAs, especially the SSU rRNA (i.e. 16S rRNA in bacteria) makes that molecule an excellent choice for 
bacteria and eukaryotic microorganisms. In the case of viruses the most promising sjource of information is 
currently the sequence of the genomic DNA or RNA. 

5 13. Obtain or develop a bifurcating, node phylogenetic tree that substantially reflects the genetic 

relationships between the organisms or viruses whose sequences are included in the nucleic acid sequence 
database that is to be used. 

C Choose a smallest sequence^length of interestfor the^ characteristic sequences, which will be sought 
10 This length will differ depending in on the length of the nucleic acid molecule or region being examined, 
the number of sequences in.the.dataset .and various constraints-by the^extrerifnen^al systems that will be 
used. 

D. Test all possible sequences of this length^ against tJaexrnnesJnthfi| nucleic acid sequence 
1 5 database that is being used in conjunction with the tree. A signature quality function such as Qs is 
calculated for every possible, sequence of length 20" at each-node in the.tces. It is preferable and 
computationally efficient to only calculate the Qs value for test sequences of length N that occur at least 
twice in the database,. Those, test.sequences that never occut are. notsignature- sequences. Test sequences 
that occur once are perfect signature sequences of the particular organism or virus from which the nucleic 
20 acid was obtained The signature qiialityfunc^ncanbe- defined- in a variety of way^ but should be 

constructed so as to determine the extent to which a test sequence of length Njjs found in all the organisms 
in the database belonging^ to_tlie_ set of. sequences- representedby a node-in the. tree and not found elsewhere. 
A particular test sequence is determined to be a perfect signature of the organisms represented by a 
particular bifurcation nc rie. o n the, ghylc^ t he nucleic acid sequencers represented by that 

25 node contain the sequence and the sequence is not found in any nucleic acid sequence not represented by 
that node. A value Qs between zero (no signature vqUre) and (perfect s^ff^tu^) is obtained for each 
test sequence at each node. 



6 



WO 02/059348 PCT/US02/02564 

E. Retain as signature sequences those test sequences having Q, above some criterion. A given node 
may encompass many signature sequences. .Iikewise^particnlar test sequence can-be a signature 
encompassed by more than one node, though frequently with differing values oi Qs. This reflects the child, 
parent, grandparent, etc telationshipbetwe^br^ a phylogenetic tree. 

F. Optionally, Repeat the, steps TX and F. for, sequences o£ the desired length (e.g. , 7mers, then 
8mers,etc). 



G. The signature se qu ence s p ermi t th e design-of hybridizatjon-pcobes for use in an assay. A typical 
1 0 assay can employ a plurality of such signature probes representing at least 50%, and typically more, of the 
nodes in the applicable-phylojy.netic tree.. The resulring h^richzation.wilLallaw the identification of the 
organism's genetic affinity without the necessity of prior knowledge of what it wouldbe. It is contemplated 
that this invention can allos^the-d&ve.lopment q£a.singLe test system that can be used to identity a wide 
variety of organisms. 

15 

H Once available, the signature sequences can be used in other ways. For exapipie, it is preferable to 
detect the presence of specific signature sequences in a sample using mass spectrometry. It is also 
preferable to use signature sequences to design PCR primers for a variety of applications. 

20 In abstract form the invention may be described as follows: 



Selecting which sub-sequences in a database of nucleic acid such as, 16S rRNA are highly 
characteristic cf particular groupings of bacteria, microorganisms, fungi, etc. on a substantially 
25 phylogenetic tree. The invention is also applicable to viruses comprising_viral genomic RNA or DNA. A 
catalogue of highly characteristic signature sequences identified by this method is assembled to establish 
the genetic identity of an unknown organism The signature sequences are used to design nucleic acid 
hybridization probes that include the characteristic sequence or its complement, or are derived from one or 



7 



WO 02/059348 PCT7US02/02564 

more characteristic sequences. A plurality of these signature sequences is used in hybridization to 
determine the phylogenetic tree position of the or^nism(s) in a sample. If the target organism is 
represented in the original sequence database and the signature sequences can identify it to the species or 
possibly subspecies^levelJQligonucleoti arrays cf many probes am especially preferred A hybridization 
5 signal can comprise fluorescence, chemiluminescence, or isotopic labeling, eta; or sequences in a sample 
can be detertedby..dkectmeanye.g L m ass- spectrometry. The method's characteristic sequences can also be 
used to design specific PGR primers. The method uniquely identifies the phylpgpnetic affinity cf an 
unknown organism without requiring prior knowledge of what is present intfae sarnple. Even if the 
organism has not been previously encountered, the method still provides useful information about which 
1 0 phylogenetic tree bifurcation nodes er^compass the organism. 

DETATT.HD DESCEQFriON OF INVENTION 

Brief Description of the Several Views cf the Drawings: 

Figure 1 shows schematically the bi-directional binary tree, structure. 
15 Figure 2 shows schematically the structure of the composite hash of the oligonucleotides. 

Figure Sshows schematically the flow chart cf the principal programs. 

Figure 4 shows schematically how Subsystem I converts the format of the sequence file. 

figure 5 shows schematically a phylogenetic tree and its corresponding Newick format presentation 

Figure 6 shows schematically the tree file in Newick format is parsed in a stepwise and bottom-up maimer 
20 Figure 7 shows schematically. the trimming is stepwise and topology-conserving 

Figure 8 shows schematically the composite hash cf the oligonucleotides is built from the 16S rRNA 

sequences 

figure 9 shows schematically-howthe number-cf. cdig^uclec^des^d-theirjre^eqtive lengths length are 
related 

25 Figure 10 shows the represftntative-pKfc^ in Newick format 

Figure 11 shows a graphic view cf the representative prokaryotic phylogenetic tree. 
Figure 12 A local region cf the representative tree following rrimming,from3B, to ^2 sequences. The 
branch numbers in the representative tree are labeled in the picture and can be correlated with the results 
given in Table F. The complete. representative tree is givenjutLNewidcfQamatinf igure 10 and shown in 

30 graphical form on the CD that is part cf this application 

Table A illustrates by example certain information^ which is.on.tha.CH.that impart cf this application. The 
table illustrates for test sequences of iength 15 the five best signature quality scores and the nodes they are 
associated with in the phylogenetic tree. 

Complete lists cf this type are on the CD for a several different sequence lengths. 



8 



WO 02/059348 PCT/US02/02564 



10 



Table B illustrates by example certain information, which is n the CD that is part of this application. The 
table illustrates signature sequences of length 12 that are completely unique to the organisms that is 
indicated 

Table C shows the subsystems of the programs used and their functions and components. 
Table D shows the numbers of possible oligonucleotides of different lengths 

Table E shows a the number of signature sequences that were found at various quality levels as a function 



Table F shows the. preferred parameters for the invention. 
Utility of the Invention: 



The invention can iden t if y th e g ene ti c groupin g an u nknown .organism belongs to even if no perfect match 
is found for the organism of interest, (the "target 9 ')• The invention designs a set of probes that allows one to 
approximately position any targ^, organs sm ocua-tree. that displays- the. genetic relationship between the 
various organisms. With the invention, it is not necessary to know what organism or group cf organisms 

1 5 one is looking for nor.is.it .necessary that it eveu.be^previously known, ta science. Ultimately, even if 

nothing matches, the invention nonetheless gives useful information. For example, it .might be learned that 
the unknown organism-belongs. to4he.gFoiu>of; .enteric.hacteria but ■ isjmt any of the Ijnown species. Using 
the invention, it is straightforward to generate a clear file with the five best signature quality values; in the 
format of Table A The^veJjestsignatui^qpalit^ are listed with the 

20 specific node in the phylogenetic tree. 

Unanticipated problems involving microorganisms ominin^yariety. of settings, including space flight, 
medicine, indoor air quality, bioweapons of mass destruction, epidemics, etc. It would be of value to have a 
diagnostic system tha t cou l d rea di ly identif^wtutinica^ ^ of prior expectations 

25 of what might be found, so as to facilitate a rapid assessment of what is occurring prior to choosing of 

countermeasures. It is.espedally; essential todetenninft-the^gen^ that is causing 

the problem as closely as possible, since this will clarify where the organism came from, what treatments 
yre likely to be effective, etc. 

30 Fortunately, each 16S rRNA sequence contains short subrseqiiences that are widely conserved throughout 
the dataset and despite the fact that there are now over 16,000 publicly available sequences, there are still 
large numbers of other sub-sequences, which are totally unique to, and hence characteristic of, a particular 
species or various groups of species that can be identified by methods cf the invention. Surprisingly, this 
pattern cf sequence conservation is so strong that it is possible to desiga specifie oligonucleotide 

35 hybridization probes that can distinguish individual organisms, and groupings, of organisms in a tree of 
relationship defined by 16S rRNA. Once an appropriate set of target signature sequences have been 
identified for a desired assay, appropriate probes can be designed Although it is anticipated that probes 
based on the signature sequences will be used directly,, in some applications, the probes can be modified 



9 



WO 02/059348 PCT/US02/02564 



before use. For example, a '^wildcard" base such as inosine might be used to extend or even modify the 
specificity of a probe. Moreover,, twanearby pcobesjnjght be, combined to make a larger probe. Any of a 
variety of formats can be used to implement the assays. Thus, the final analysis system may utilize PCR- 
amplified nucleic acids or, because rRNAs^re. typically present in many thousands of copies per cell, just 
5 the sample RNA alone. A variety of detection systems can be used, comprising fluorescence, 

chemduminescence-andisotopic detection,. The resu1ring t assay is-highly compatible with hybridization 
array technology (DNA microarrays), which will allow the simultaneous assay of all the nodes in the 
underlying tree in one-experiment Thus, it is possibleio replace, many tests with just one. 
It is inherent in the prior art that only predetermined microorganisms or_grpups of microorganisms will be 
1 0 detected This reflecta-tha fact that prior-art assays- are based-on.prior identification qf specific probes for 
the intended apphcation. It is widely believed that a microbial detection system cannot be designed without 
prior knowledge of wnatislaba.rifttertftri Thp. invention descTihedJaere.implements a novel approach to 
assay design that overcomes this problem 

15 Scientific j>asis of the Invention 

Although the invention is not to be limited by any theory or bythe way in which the^ invention was 
achieved, the following may be helpful in understanding the invention. An extremely effective approach to 
determining genetic relatedness amongbacteria is to amplify and sequence their 1§S rRNA genes (Fox et 
ol, 9 1980; Woese, 1987). The resulting sequences are aligned with other 16S rRNA sequences and an 

20 appropriate method, e.g. maximum likelihood, is used to construct a phylogenetic tree. This process is 
reasonably fast, very accurate and facilitated by programs and data available via the Internet at the 
Ribosomai Database Project. (RBP) web site http://www.cme. msu.edu/^P/litnn7mdex.html) (Maidak et 
al. 9 2000). Many thousands of 16S rRNA sequences, representing essentially all known genera of bacteria, 
are now available in the RDP and other ribosomai RNA databases. Therefore, when a new isolate of 

25 uncertain affiliation is found here on Earth, its genetic identity can be inferred from its placement in the 
16S r^NA phylogenetic tree. 

It was observed early on m^theUjSSaRNAJ^ te ra nira that there were in fact many characteristic ribonuclease 
T l (a subset of all possible oligonucleotides that consists only of those which end in G and contain no 

30 internal G) "signature" oligpnucleotides,(ffK^^ of "such signature 

oligonucleotides in a set of 16S rRNA sequences actually reflects the fact that certain individual positions 
have a particular value (i .e. Cy Glar.U).inaU.orgaiiisms-belar^^ cluster and a different 

value for organisms which do not belong to the cluster. The phylogenetic breadth of ithe cluster 
encompassed is different far each signature position. and fog qignntytrgs art* typically qomewhat noisy in that 

35 the characteristic nucleotide is absent in some organisms that belong to the cluster of interest and present in 
some organisms that axe-joutside the cl»^,r.-.Theinfbrmation.thatis-carried-by these very informative sites 



10 



WO 02/059348 




PCT/US02/02564 



is nevertheless precisely what underlies the success of standard algorithms that construct phylogenetic 
trees. 

In order to quantify this information, a signature qualify index^ which. ranges-fcom ^(no meaningful 
signature) to 1 (perfect signature) was developed for use with the ribonuclease T 1 oligonucleotides (McGill 
et al., 1986). Such an index allows, the, quantitative, characterization-of the utility of any oligonucleotide in 
determining if an unknown organism belongs to any particular genetic_grouping in a particular tree of 
genetic relatedness. In order, taimplement-the. invention it was necessary to modify the signature quality 
function for use with complete sequence data The signature quality index used is of the following type: 

Qs =(%}x^U-°f B ±- (1) 

where Q, is a measure of signature quality, % is the frequency of the signature sequence within the group 
under consideration, and °f g is. the frequency of the signature Qpg^e^c^ outside the gjroup of interest The 
frequencies are based on the number of sequences in the dataset that a particular oligonucleotide matches. . 
and the resulting function againvarie&frcunja signature). 

To illustrate this function, consider a particular heptamer, which is found in 50. distinct sequences. If 40 of 
these occurrences are in a single taxonomic cluster, which contains 50 members and^the remaining 10 
occurrences are scattered among the remaining sequences the resulting value cf Q, is 0.64. Finally, the user 
of the invention needs^tann d e rs t and that whe.n_mamhp.rR nf a cftqnftnrA Hitter share an oligonucleotide 
which is not found in non-members of the cluster (e.g. when Q, is higjtt) the oligonucleotide in question will 
almost always be found to occur in the equivalent place in all the 16S rRNAs that hajve it. This reflects the 
fact that useful signature sequences are phylogenetically conserved at various levels oif genetic relationship. 
This is not obvious because it initially seems veiy_ counterintuitive. It is, however, the reason high quality 
signature oligonucleotides exist If this were not the case the various oligonucleotides would be randomly 
scattered throughout the various sequences and high values cf would be uncommon and not predictive 
of what would be found in sequences that were not yet known 

It is also important to realize that there are many alternative ways in which the signature quality function, 
Qb, is defined One for example migjht take the logarithm of values or use values of 1- <\ More to the point 
one could square the first factor in Equation 1 to give more weight on any falsp negatives or cube the 
second factor to strongly penalize false positives. 

What size of oligonucleotides will give useful signature information? In the case- qf shorter small 
sequences, the equivalence cf position is overshadowed for small oligonucleotides such as the 4,096 (4 6 ) 
different hexamers, many, of which can be expected to occur by random chance among the 1,500 hexamers 



11 



WO 02/059348 



PCT/US02/02564 



that one expects to find in a single 16S rRNA sequence. Thus, the heptamers (4 7 = 16,384 in total) 
represent the smallest sequence length that is likely to produce meaningful signature information. On the 
opposite side, large oligonucleotides tend to be unique to individual organisms. That is to say, as 
oligonucleotide size increases, a larger portion of the signatures will be for leaf no^es, e.g. small numbers 
5 of closely related organisms and a decreasing percentage will signify internal nodes. Based on prior 

experience with 16S rRNA ribonuciease Tl oligonucleotides, it is likely that sequences larger than length 
15 will mainly have utility for leaf nodes. 



Bepign and implementations 
10 Programming language 

Except the first program readseq, which is preinstalled as a binary executable, all other programs developed 
for this project were written in PerL 

Perl is a freely available, non-proprietary open-source programming language. Thu^, programs written in 
1 5 Perl will not be affected by possible future changes in the license cf the language conipiler/interpreter. Perl 
is also a very high-level language for general purposes. It has 4 function points per 100 lines of code, 
compared with 0.8 for C and 2 for C++. This means that software developments Pferl is generally much 
faster than that in most other programming languages. Perl is. especially w efficienj in dealing with text, 
which makes it an appropriate choice for manipulating genetic sequences. Jn addition, Perl's excellent 
20 built-in data structures > automatic garbage col lection „anri almost unrivalled portability also make it more 
attractive. 



More information on Peri and its newest release can be found at the Perl web site: hjtp://www.peri.com.2.2 
Data structures. 

25 

All Perl built-in data structures, namely scalar, array,, and hash,, are used in this invention. Because of the 
complexity of the data presentations, more sophisticated data structures such as bi-directional binary tree 
and composite hash, are also used 

30 Given the characteristic structure of the phylogenetic tree,.it was natural to represent it as a binary tree in 
the program In this case the tree structure is special in that it is bi-directional. The parent tree node lias a 
pointer to each of its two child tree nodes and the child tree node also has a pointer back to its parent tree 
node (Figure 1). This unusual tree structure is required to facilitate the signature quality index value 
calculation at each branch tree node (excluding the tree root and ail the leaf nodes). 

35 

Bach leaf tree node has five data fields: "shortName**, <*MlName", "leafNumber", "isValid", and 
<c isMatched" (Figure 1). The first two fields hold the abbreviated name and the full name of the prokaryote. 



12 



WO 02/059348 PCT/US02/02564 

leafNumber records the sequentially assigned number of the leaf node in the tree. The last two are Boolean 
variables used mainly for calculation purposes. Each branch tree node has four dat^ fields: ^odeNumber", 
'"numLeaves", "numValidLeaves", and "numMatchedLeaves" (Figure 1). The first field records the 
sequentially assigned number of the branch tree node. The other fields record the munber of leaves, "valid" 
5 leaves, and "matched" leaves descended from this branch tree node respectively. 

figure 1 shows the bi-directionai binary tree structure with three leaf nodes. Not^that a parent node has 
two pointers to its child nodes and each child node has a pointer back to its parent 

10 A composite hash was used to store all the oligonucleotides of a specific length derived from a dataset of 
the prokaryotic 16S rRNA sequences and their related information. The "infrastructure'' of this composite 
hash was implemented with Perl's built-in hash. Because of the complexity of tfce information on each 
oligonucleotide, an anonymous hash data structure was heavily used to accomplish the task. 

15 In Perl, a hash is mmpngflrf nf thet unique key* and thftir rnrrftgpntiriing values Tjie keys of the outmost 
layer of the composite hash are the sequences of the oligonucleotides and the value of each key is an 
anonymous hash which has three keys - "matchin^Times", "matchingjOrg", and "treeNode Values". The 
value of "matcbingTimes" counts how many times the oligonucleotide occurs in the 16S rRNA sequence 
dataset The value of "matchedQrg^ is the set of the names of the organisms who^e i 6S rRNA sequences 

20 are matched by this oligonucleotide. Because cf the special nature of the hash - that is, its keys must be 
unique - the set is also implemented with an anonymous, hash, whose keys, are the munes of the matched 
organisms and the corresponding values are set to "under 7 " The value of "treeNode Values" records the five 
highest quality index values at the branch nodes. This is implemented with an anonymous hash whose keys 
are the branch tree node numbers and the corresponding values are the quality index values (Figure 2). 



25 



Figure 2 shows the elaborate structure of-the~composit&-hash usedJii-thepro^am. Only two entries are 
shown in this figure. A hash is represented by a table and the keys are shaded. 0 denotes the data type 
4 *unde£" in Perl. The data.in.this hash are, for elucidatory purposes only. 



h 



30 Algorithm: 

The signature quality index measures how v^Lan.cfligpaucleotide- $ taxonomic group of 

prokaryotic organisms in the phylogenetic tree. Thus, the index qualitatively measures the "quality" of the 
signature sequences and ranges.,from 0,(nameaningful signature.) to L (perfect signature). The index can be 
mathematically expressed as: 

Q> = (<Ox(l-%). (1) 

3 5 where Q, is a measure of signature quality, % is the frequency of the signature sequence within the group 
under consideration, and °f; is-theirequenc^-oLthes^^ outside the group of interest 



13 



WO 02/059348 



PCT/US02/02564 



Given a defined group of prokaryotes, % and °f 8 can be empirically described as: 




where N M is the number of probe-matched prokaryotes in the entire tree, Nqm is fhe number of probe- 
matched prokaryotes in the group of interest, and N OT is the number of prokaryotes in the group under 




= (N GM 2 )/(N OT x.N M .) (4) 



Preferably, the invention uses equation (4) to calculate the signature quality index Q, and in order to do so 



internal tree node. Since equation (4) is derived from equations (1), (2), and (3) r if any one of these three 



distributed in 16S rRNA sequences, equation (4) will change accordingly. This great flexibility provides 
system improvements that are included in the inventioa 

System implementation 

15 The identification system used-to find- characteristic oli g o n ucl e otides in t he.l6S rRNA sequence dataset 

consists of the following twelve principal programs and several auxiliary programs, all provided on the CD 
enclosed with the application. 
Principal programs: 



during run time it keeps tracking Nqm, Not, and NmoF every, oligpnucleotide of a specific length at every 



10 equations changes, which mayn 




new. 




occur and are 



20 



25 



readseq (preinstaJied program, noj written by the author) 

fasta2flat 

seq_classifier 

treejparser 

select_seq 

probe hash table generator 

cak>_node_value 

result_pijinter & result_printer_ 

group_node__iister 

list_hit_branch_nodes 

hybridize 



30 Auxiliary 



programs: 

node selector 



tree2newick 



14 



WO 02/059348 



PCT/US02/02564 



Figure 3 gives a panoramic view of the relationship among the principal programs and the dataflow in this 
system. This otigpp"?i?pti de id entification system ran h e roughly, divided into four functionally different 
subsystems, which in turn carry out sequence file format conversion, internal data structure preparation, 
function value calculation, ancLresult presentation respectively (Table A). 

5 

The una^igr^ pn>Va*y^ ] 6S rRNA ggguftnPAQ ™ere rfo wniAaded from the RDP in, Genbank format The 
16S rRNA sequences are from those prokaryotic organisms that appear in thejpomprehensive prokaryotic 
phylogenetic tree. GenbankJbrmatis^he. standard formatfkir-annotatftd niicleic-acidand protein sequences. 
In this format, a sequence is recorded with several fields of information including its locus, definition, 
10 reference, and origin, Since-only-the abbreviatecinames. of the orgffnisrnivand the 16S rRNA sequences in 
the sequence file are needed for the purpose of this project and all other information is redundant, it is 
necessary to extract thejieeded data frxan.the.sequence filaanddiscard-the-extra in order to increase the 
program efficiency. 

15 This data extraction fjmrj jrmAlity is fi T miipH hy ^^gyRtetn T thesequence^file format conversion 

subsystem, which is composed of readseq and fasta2flat (Figure 4). Readseq is a preinstalled program. It is 
a convenient and useful ^utility to rrmmrt ttu» fhrmat nf a. sequence file among fienh^nlc, FAST A, and many 
other formats. FASFTA format is also a common sequence format and usually used in sequence alignment 
In this format, a right .angle-bradbet.{'>'>g^^ on the same line, which is 

20 followed by the sequence itself starting on a new line. This project used readseq to change the 16S rRNA 
sequence file from Genbanlcfonnat-taEASTA ojthe organisms and the 

16S rRNA sequences are retained while all other information is discarded 

Since the 16S rRNA ?p*ni *"<y is l"ng and ey pends several lines in FAST A format, nj is not convenient to 
25 use the sequences in this format To further facilitate the manipulation of-the 16S rRNA sequences and the 
corresponding orgarrisranarnre, the, program fasta^flat takes^the-sequenceJile in FAST A format as the 
input and rewrites the sequence data in a "fiat" format, in which every line is a data entry starring with the 
organism name, followed)^ a t a h cha ra c ter Q fe /t!^a&Ae.s epar a tnr fo Uowed-by-ajstyng of letters (A, U, G, 
C), which is the 16S rRNA sequence. 

30 

As shown in Figure 4, Subsystem I converts the format of the- sequence file. 

Subsystem II builds the binary prokaryotic phylogenetic tree and the composite oligonucleotide hash. 
These internal data structuresjwere used to. calculate the fiincrion-valne.at.each branch tree node. 

35 Release 7 from RDP contains a total of 7,322 prokaryotic 16S rRNA sequences. However, not all of these 
sequences can be used to generate the set of ohgcmucleotides (please refer to the section on program 
probes Jiash_table_generator for explanation on how the set of oligonucleotides, was generated), because 



15 



WO 02/059348 



PCT/US02/02564 



many cf them are only partial sequences of 16S rRNAs (e.g. a sequence has only 300 nt instead of about 
1,500 nt, the full length, of 16S tRMA) and-many TOst^po»^ns-kuthe^qB^nces.t|iat have not been fully 
determined (i.e. if any position is noted by a letter other than A, U, G, and Q. Program select_seq filtered 
out these problematic- £< iH¥alid^ sequences-and retained- 1^921 "valid" sequences th^t are fully determined 
5 and longer than 1,400 nt 

The comprehensive p^1<ar^tic p4iylQggiietiG tg^ bas e d upon 16S iBNAseqaenee^ in Ne wick format was 
obtained from the KDP web site. The Newick format for representing trees in computer-readable form 
makes use of the corre^pndence-between-trees-and.nested parenthesesy noticecLin 1^857 by the famous 
10 English mathematician Arthur Cayley. A simple exemplary tree and its corresponding Newick format are 
depicted in Figure 5. 

presentation 

15 

The tree in Newick-fonnatends-with a-semicolon, Interior- (b?anch->aodesare represented by a pair of 
matched parentheses. Between them are representations of the nodes that are immediately descended from 
that node, sepaiated-byxominas, The-tree in-Figure? has-six lea£nodes-at*he-tips,C^, B, C, D, E, and F) 
and five branch nodes inside (the root node and the branch nodes 1 - 4). A branch node can be at any place 
20 where a leaf node IcKates^.wii trti^ i^ levej. The comprehensive 

prokaryotic phylogenetic tree has 7,322 leaf nodes and 7,321 branch nodes. Since the tree is far from being 
balanced (as the evolution, of life itself is-notbalanced>> some branches, of the tree go very deep. 

The Newick format of the tree file obtained from the RDP website largely conforms to the Newick 
25 Standard described-above-w*t^ n*foor :difi erenee s, suckas:the:as^^c£comma and single quote. See Figure 
10 for an example. The tree file contains taxonomic group identifiers and branch lengths. Much information 
is also recorded foreveiy. leaf node, which includes the abbreviated organism name, the full name, and etc. 
When the program tree jparser parses the tree file and builds the internal tree structure, only the abbreviated 
and full names of the organism are kept for each leaf node and all other inf brmation is discarded The 
30 abbreviated name is later compared with every name in the set of matched organisms of every 

oligonucleotide to determine if this leaf node is matched by a particular oligonucleotide. The full name is 
used purely for illustrative purposes whenever clear identification of an organism is necessary. Since this 
system does not use taxonomic group identifiers and evolutionary distances, these d^ta in the tree file were 
also ignored 



35 



Due to the algorithms.and mettodsaisedJtacanstr^ all phylogenetic trees are 

bifurcating, that is, a branch node has exactly two child nodes: a left node and a right node. This feature of 



16 



WO 02/059348 



PCT/US02/02564 



a phylogenetic tree makes a binary tree a natural and excellent choice of data structure to present it in a 
program, in some cases, the distinction-be tweeii the relative branching; orders is-ver^ close and three or 
more branches are shown as emerging at the same node. Such nearly bifurcating trees are not a problem for 
the method as they are read%-reduced to-abtfurcating tree. The tree ffi e in-Newick^nnat is parsed in a 
5 stepwise and bottom-up manner Program tree__parser scans the tree file and add one leaf node a time to the 
nascent inteniaLti^facilitatedJ^ a stacicof references. Figure d shows-how-a simple internal binary tree is 
built step by step (the reference stack is not shown). 

Figure 6 shows-how the- tree fileiti Nevvick-format is parsed-Ln-a-stepwise andbottpm-up manner, (a) A 
1 0 phylogenetic tree in Newick format (b) The internal tree structure is built stepwise and from the bottom up. 
The filled circles-denote iea£nodes^and4hehcdlaw~rircles branch nodes. 

Program tree jarser builds the internal comprehensive prokaryotic phylogenetic tree using the tree file in 
Newick format as-the blueprintand-seriahzes itto an external-binary-file SSUJProk treebin for possible 

1 5 later use. It then marks the leaf nodes in the internal tree structure "valid" or "invalid" according to the 
names of prokaryotes-in-fUe-S^ the output of program seq_classifier, and 

serializes the marked tree to file SSUJ?rok- treeMarkedTotal.birjL This tree structure can be used later to 
calculate the function values, butthe process is inefficient/because nearly 74% of the leaf node sequences 
are not qf-thfrvefy hjghesi^qua&ty; Thfctceeda: i^r^^d^l^<>y t^t^rtc^nf mvalid U>x$ nodes makes its size 

20 unjustifiable. Another difficulty is that some taxonomically different branch nodes may actually represent 
the same groups vahddescendant leaf nodes. 

These potential difficulties were avoided by using a representative tree based on only the highest quality 
sequences. Building sueh-arrepresentattve^fr-re^ cfjthe existing published 

25 tree of 7,322 sequences to determine which groupings and individual sequences, e.g. known pathogens, 
need to be included This-repBeses^^ qualifications: 

■ It only contains bacteria whose 16S rRNAs haveibeen fully sequenced 

■ At leasfreneof^titsnvrepresentfreaeh^ grouping. 

■ The topology of this representative tree should conform to that of the comprehensive tree, in order 
30 to construct a representative.-teee^2^ 16S rRNA 

sequences are of the highest quality. The list of the leaf node numbers of these 929 prokaryotes was kept in 
the text file seleetedLlea£_nede^lisfc:Tl^^ than the 

98-sequence version provided RDP with its Release 7 dataset 

35 In order to keep the topology pf-the r^e^enta^^tree:inraax>rdatice: with: that: o£ t^e comprehensive tree, 
after writing out the binary files SSU_Proktree.bin and S SUJProfc treeMarkecfTotal Bin, program 
treejparser used the ltst^of^electeddLea£nodesinrfile selectedbjeaf^ncdevlist as- the. reference to "trim 



17 



WO 02/059348 



PCTYUS02/02564 



away 35 (figure 7) invalid and valid-bnt-unselected leaf nodes in the tree structure, resulting in a 
representative tree with 929 valid leaf nodes. . This trimmed: tree- stmctureowa&^BgaUfted to the binary file 
SSU_Pro^treeMarkedTrimmedbin, which was later used in the signature quality index value calculations. 

5 Figure 5 rthistrates thafcihe txiriumrig^s- stepwise^idtopojogy conserving. 

Program select_seq takes three files SSUJProlefasta converted valid, selected_Ieaf_node_list, and 
SSU_Prok.tr^.bin-as~thezinput-Rud g^reates fifeSS^Jfook as the output, 

wiiicli will be used to construct the composite oligonucleotide hash in the next step. Input tile 
SSUJProkfastac»nv^ Ifccantais^ all 'Valid" 16S rRNA 

1 0 sequences in a special c< flat M format File selected_leaf_node list keeps all leaf node numbers of the 
selected prokaryotes;.. SS^^ prokaryotic 
phylogenetic tree is retrieved The tree structure is used to index between the leaf 1 node number and the 
abbreviated oaganisgfcnaaieaa^he The;outpufcrile holds tfcf 16S rRNA sequences 

of the selected organisms in the same format as SSUjProkfasta converted valid 

15 

Program probes hash-table generator is responsible for generatingfee compositeJ^ash, which records the 
needed information for each of all occurring oligonucleotides of a specific length from the 16S rRNA 
sequences dataset. The program takes-the probe length~(x) as~the cornmand-line argument and implicitly 
open sequence file SSUJProkfasta. converted valid selected to get the abbreviated names of selected 
20 organisms and their corasponding l^xRNA sequencer The.hashc£cmprefees^tf length x is output as 
binary file hashForProbeLengthx.bin. 

Since only the ohgcmudec^des occui^ing;in-the 16S rRNA sequences are consider^ interesting, naturally 
all ohgonucleotides and their initial cognate information used in this system are derived directly from the 

25 16S rRNA sequences. If wexonsider. the numb^ length, the 

computational saving by deriving oligonucleotides directly from 16S rRNA sequences is substantial. Out of 
all possible 1, 048,57Q-(4*°> dre anws ; , akthefrrac^ 1,921 "valid" 

16S rRNA sequences and 133,599 of them occur more than once. Only these 133-599 multi-ocaurriiig 
decamers (12.7%-a£allJ.are.usedin-the next- step; to-calculate the function values^ since we are only 

30 interested in identifying the phylogenetic neighborhood/group of an unknown bacterium By definition 
oligonucleotides-that are unique cannot be-characteristic of a group. 

Program probes_hash_table^generator reads in the selected 16S rRNA sequences and for each sequence it 
excises ohgonucleotkles^tte-spechledlengrtv-from-the S' end* shifting -one nucleotide at a time, to the 3' 
35 end (Figure 8). Since an oligonucleotide can occur in 16S rRNAs from several organisms and several times 
in one particular 16S rRNA, theocxaimngjtm of an-olig9>nuclec(tide in the hash can 

only be equal to or greater than the number of the organisms (matchedOrg) whose 16S rRNAs it occurs in. 



18 



WO 02/059348 



PCT/US02/02564 



10 



20 



35 



Figure 8 illustrates how the composite hash of the oligonucleotides is built from the 16S rRNA sequences. 

At this point the system- has completed the necessary preparative work, namely t§p sequence file format 
conversions and the data structure constructions. With those steps complete, the system is now ready to 
calculate- theJ^ctiorLvalueat each^Dianch Subsystem HE, the function value calculation 

subsystem, consists of only one program - calc_node_value. It takes the probe length (x) as the command 
line argument ancUmpiicitiy reads4rr-the corresponding binary probe hash file ha^^orProbel^ngthjc.bin 
and the binary tree file SSUJProktreeMarkedTrimmedbia 



For each multi-occuiring^eygfffitift^ hash file, leaf nodes 

in the phylogenetic tree are marked if this sequence occurs in the 16S rRNAs of the organisms at these leaf 
nodes. At each branch nedfr tb&nuntheit.<if its^ desce.rtdent marked-leaf nodft^isrCOurrijBd by using the 
unusual backward pointers in the tree structure. The signature quality index values are calculated at all the 
1 5 branch nodes and then- serte^mrdftseendmg order Tlieto|fcfive.hi^^ corresponding 
branch node numbers are kept as the value/key pairs in the treeNodeValues anonymous hash field of this 



probe in the composite-hash^Afteirthfvcalcufation ^ompletedihe:Tft^ilt iaueutpri>^s a binary file 
hashForProbeLengthxCalc.bin, which is essentially the same as the hashForProbeLengthx.bin except t 
the treeN ode Values for. eacl^muHtroecttiTi ngu&gettutefeof ider is poptilated.witk the calculation results. 



Subsystem IV, the-fesufrpresematron subs^^^ and retrieves the 

calculation results from file hashForProbd^engthxCalc.bin. It is the open end of the system: the calculation 
result can be ^"^y^^^^^p^^^t^-^ 0 .variefo ;ofcways;beeauserany program; *g iQtig as it can reconstruct 
the composite hash from the binary file, can "plug into" the system via the subsystem IV and interpret the 
25 calculation results in its owrr„v^ Cun^ (Table Q. 

Programs result_reporter and result_reporter_, as their names suggest, are a pair of similar result-presenting 
programs. They both- take-theJengthrof ;rgober^> a&-the ccttumattd-lme: argument; reconstruct the composite 
hash filled with the calculation results from corresponding hashForProbeLengtlixCalc.bin, and give a list of 
3 0 signature sequences wtehrinforaaat^ theirr tdentiffed-br^nch nodes, and the 

descendent leaf nodes as the output files. The only difference between these two programs is that the 
former- outputs the-hst q£.signatu^ numbers of the 

identified branch nodes while the list output by the later is sorted in descending order of the signature 
quality indexes. 



Programs group_ned^listoaad^is^ result fmn^ the perspective of the 

taxonornic groups. group_node_iister lists all identified branch nodes along with their corresponding 



19 



WO 02/059348 




PCTYUS02/02564 



signature sequences of a particular length specified at the command line. list_hit_branch_nodes takes a 
more ambitious approach: Itrgetsa&the calculation: results of oligonucleotides from heptamer to undecamer 
from files liashForProbel^ngthxCalc.bin (x = 7 ^ 11) and collects the number of times that a branch node is 
identified by characteristic oi^Ettiaeleetide& of a specific lengths signature quality levels 0.6, 0. 8, and 1.0 
respectively. The analysis result of this program is the useful statistics which imply the relationships among 
the fxequeney-with whichraJ^ the oligonucleotide: length, an^ the signature quality. 



Program hybridize was used to test the usefulness of the characteristic oligonucleotides that the system has 
discovered so far. It takes^a sequeiieerfileras the input in. wMch^eiy-entry starts, wkrr a label followed by a 
tab character ("\t") as the separator followed by the actual 16S rRNA sequence. Although this program can 
use any reasonably -gBodsefccfechar^^ probes, in this 

preliminary test nonameric signatures were used and they gave satisfactory results. When hybridize reads in 
a 16S ri&NA sequenee^ifejCflmpaBB^ characteristic 
oligonucleotides with a signature quality better than a specified threshold in the selected probe catalogue. 
When a probe is expectedrtobimito^ branch 
node in the representative phylogenetic tree. The output of hybridize is one marked representative tree per 
each unknown-16S-rR&Arare^ GiS, ^1.0). Some interesting 

and noteworthy features of the results will be discussed later. 



Valid 16S rRNA sequences 

The 7,322,bacterial 16S r&NA sequences obtainedfrorn-RJDP release 7 have multifarious qualities. Some 
were fully determined in terms of both the length and every position of the sequence while others are either 
partially sequencedand/or- contain- one or more undetermined positions. Any sequence that was either less 
than 1,400 nucleotides in length or has nucleotides other than AUGC (e.g. especially N standing for a 
position where th^<*^ie^fip^m*ianAUip^^ by tfye system and was 

filtered away. Many of these sequences had very minor difficulties, Le. marginally shorter than required or 
containing up- to 3 uncertain- sequence- assignments and could have been- usedwithpfit significant effect 
However, since 1,921 16S rRNA sequences met the strongest criteria it was possible to maintain the very 
highest standard Thus-only tiu^seqiiences-deemedvai id were retained to. generate the sets of signature 
oligonucleotides. 



Although the two-conditions- o^sqpalh^ing.pppblematiG- 16S rSNA sequences g^eatfy simplify-how the 
system deals with low-quality sequences, they are probably far too strict and as a result the current 
calculations likely did. not make maximum, use of alltheseQ^nce-im^nnation irrtfte dataset Sequences a 
few nucleotides short of 1,400 nt or those that contain a small number of undetermined positions are 
currently discarded, even-though their- signature sequences- reraain-mostly-intact Ta.mitigate this problem, 
the quality demands can be moderately relaxed, i.e. by lowering the length requirement and only discarding 



20 



WO 02/059348 



PCT/US02/02564 



10 



the oligonucleotides containing undetermined positions instead of the whole 16S rRNA sequence. 
However, if a rerjresentati^ph yiflgy^ (as in this system), 

the effect of losing sequence data should be mild since only a subset of 16S rRNA sequences are used 
anyway. If a branch of the^mpreliensiyephylogenetietree is absentf font the-repre^entative tree due to 
lack of valid 16S rRNA sequences in that cluster, either the quality demands can be decreased as described 
above or sequences fjcam^trafrveq^^ that this particular 

branch will be included Also, it should be appreciated that in some cases, the distinction between the 
relative branching orders- may.be^ery dose in-some areas^o£th&tree^ Whenctrasreequrs it is not uncommon 
to show three or more branches emerging from the same node. Such nearly bifurcating trees are not a 
problem for the method as the^axeneadily redus^d to a bifurcating tree. 



15 



20 



Oligonucleotides in 16S rRNA sequence dataset 

The number of ^all-possible^iigomieliotides-af aispec^cilengfchrewdent^d^eflds^on both the length and 
how many different nucleotides are legitimate at each position Given that there are four different 
nucleotides (A^U, Gj G.in-RNArand-A, T?; G; G raDNA), if toleng^roftlif^igonucleotide is it, the 
number of all possible length-^ oligonucleotides is 4". When length n is large, the oligonucleotides 
occurring in the 1 6S rRNA sequent daiasefrareiflnly. arnonrf 



bp^ssible oligonucleotides 
and there is no simple formula to calculate this number. Table D summarizes these numbers for 



oligonucleotides under e 
and gives a direct visual perception of the trends. 



c tctundeeamen Figure 9 plots these data 



Figure 9 shows that the-nufnbe^o£dagafflieleetT<ifis and-.the-len^h-are'rekited. (a) The number of all 
possible oligonucleotides increases exponentially with the length. The curve is described by function f(x) = 
25 4". (b) The numbers ctf.tkeztotataiidrmult^ sequence dataset 

also increase with the length. The increases are slower than that in (a) due to the sequence context 
constraint from 16S rRNA 

Signature oligonucleotides in 16S rRNA sequence dataset 

30 

At a branch node in the phylogenetic tree, if-an-oligonucleotide gives a quality hufej value greater than a 
preset value, this oligonucleotide is said to be a signature at that branch node since it can identify that node 
better than other oligonucleotides which-have a lower-value of the quaUty^index, In.the current system, 0.6 
is the cutoff value, i. e. only oligomers with function value over 0. 6 at a branch node will be presented in the 
35 results. 



21 



WO 02/059348 



PCT/US02/02564 



Of course, several signatures may identify a branch node and an oligonucleotide may also be a signature 
simultaneously at several branch nodes. Clearly, the-higher the quality-index, value, of a signature at a 
branch node is, the better it can identify that node. A signature with a function value of 0.8 is better than 
one with a ftmction^value of-G.6-atthesame branch, node- and-a- signature with function value 1.0 is perfect 
5 forthat node, which, according to the definition of the signature quality function, Q>, means that all 16S 
rRNAs having this-signature sequence are in. the same phyiogenetiG group defined ^y that branch, node and 
thus no 16S rRNAs with the same signature are outside that group. 

Signatures of different, lengths-are distributed ^ n the phylogenetic tree differently. Tl^e general observation 
10 is that long and short signatures have polar distributions in the tree: the long signatures tend to identify the 
branch nodes near the tree leaves white the short ones are mote likely to-pkk- out those near the tree root 
This trend is evident when the results of pentameric and undecameric signatures are compared The result 
shows- that35out^o£35 (WO^g^fect^Q, - La>-p©Q4ameriasjs^tan^ideBtifyth§ root while 11,958 out 
of 1 8,746 (64%) perfect undecameric signatures identify the two-leaves-as-two-children branches. 

15 

Short signatures, e.g. pentamers andrhexamei^ex^minec^by the- system, are generally too unspecific to 
identify any interesting small groups in the phylogenetic tree with (J,.. They tend to identify the whole 
bacterial tree insteaeb T Tn^^y Ar- ^gm;4i«^-fmr4<^«f^^ is used^then sequences of this 

length might be significant On the other hand, long signatures, e.g. undecameric and longer 
20 oligonucleotides-, af^mcreasfii^ spe^^ organisms and 

two-leaves-as-two-children groups. Signatures with a length between seven and eleven should have a more 
balanced distributiOHrin-the.p^jtogMietic tree. 

2,533 nonameric signatures can identify phylogenetic groups with three or more (up to 23) members 
25 perfectly. On >0r8 and->©.-6 quality levels.there-aiier5*5&a andr:fc5,34frnonamerte signatures respectively. At 
this length, the signature sequences cover/identify -80% of the phylogenetic groups; in the representative 
tree. The user can-f ef^tolable^fb^ a_qiiick comparison 



30 



In Table E a "gap" between the numbers of signatures shorter than octamers and those longer than 
heptamers is evident Gfterery^e^-of^gnat^^ where.ps is equal to 1.0, 0.8, 

or 0. 6, there is a sharp unexpected increase in the number of signatures and tree coverage from heptamers 
to octamers. 

Table E providesa wn^arison-amongsignatures-of varioiis-lengths-rangkigjBrom-p^ntamers to undecamers 
35 and also 15-mers. Only signature sequences that can identify phylogenetic groups with three or more 

members are-counted^in crasteictingjhis.table. A awspiUeP program is used to catenate the coverage. Any 



22 



WO 02/059348 



PC1YUS02/02564 



30 



branch nodes other than those that have two leaf nodes as their two child nodes in the representative tree 
are regarded as phyiogenetic groups (635 in-total). The signature quality Q, is greater than 0.6. 

Ulustrative.Examples - 

5 

Example 1. A Loeal-R^gtm^of-the-Tree& Rs- Associated Signatures 

The purpose of this example is to better illustrate the relationship between the signature sequences found 
and the nodes of the- tree-used-kir a- nrck^detailed levet Tabte-F, Hsfcs-ofii^the- results with reference to a 

10 local region of the comprehensive tree. Before trimming this region contained 16S rRNAs representing 38 
organisms. A total of 23- of these sequem^ were-of-the-very highest quality but-many of them were very 
similar so a total of 12 sequences were selected for final inclusion in the representative tree. This local 
region of the repfesegtaihcsrff ee-is ^wnrin,Eigttret2> The nus*er& of nonameric^ undecameric and 15- 
mer signature sequences at each of the 1 1 branch tree nodes in this 1 2 organism subtree in different ranges 

15 of quality levels are suramarized;iiL;T^ signatures at the Qs 1.0 

level whereas its parent branch, node 5549, has 14 perfect nonameric/undecameric/15-mer signatures. 
Several of these are the-same-sequen^s, whiehrsecva-a&.-sqgiafare of Qs at the 0.8 

levet This result draws attention to the fact that many individual oligonucleotides are signatures of several 
branch nodes at differing levels^ <&<^ between nodes. The 

20 signatures identifying the taxonoinical group represented by the local root node 5577 of the representative 
tree illustrate another- eemmanrfeaiure^ Qf tl^'tT: perfect si@aatureSwfbr. nede-55 77, $ve are nonameric, six 
undecameric and six are 15-mers. However, every one of these five nonameric signatures appears as apart 
of one of the six undecamericr signatures^ :Xhi& inclusion, of shorterr stgnatore-sequen^s is a part of a longer 
one is frequently seen regardless of the signature length, the signature quality level and the position of 

25 interest i^ the phyiogenetic tree. 

Example 2 

In silico hybridization 

Once the characteristic €^gonuclec4idesL(signafette^eq^ences) fi»mr-l£S^fiI^sequence dataset are 
identified, they can be used to implement in silico hybridization (This is not carried out in the laboratory. 
Instead, it is performed virtually b^a. coinputet program^ thus^ mrsilico^&ltis^ can be Either 



23 



WO 02/059348 PCT7US02/02564 

executed as a standard experimental routine or in this case as a quick test of the validity of the signatures, 
which have been identified 

Since^these-characteristie oHg^nodeotides were derived from the- selectetfcvalid l^S^rRNA sequences using 
5 the corresponding representative tree, several valid 16S rRNAs that were not selected to make the 

representa&ve-i rc c were chosenras 16Sr£RNAafrom:"am^ent^d" bacteQarPfogram hybridize was used to 
perform in silico hybridization between the unknown 16S rRNAs and the characteristic oligonucleotides. 
The unknowns were thus pJaeedmrtheir predicted-phylogenetic^tieigtoorhoods ia^the representative tree. 
Because the comprehensive phylogenetic tree is available, thus the validity of the predictions could be 
1 0 quickly and definitively checked 

This in silico hybridization experiment was-set up-wkhthes©^ theMlowing.parameJjers: Probes length: 
9 (nonameric) and 11 (undecameric) quality level: 0.6, 0.8, and 1.0 

15 

16S rRNAs control: Escherichia colt (E, colt) 
tests with the following valid sequences: 

Methcmobacterium-fs^iciciun (Mb.formici) 

Tetragenocuccus halophiles (Tgc.halop2) 
20 Orientia tstOsugamushi (Orttsuts6) 

test done with foliowinginvalid sequence: 

the isolate M2 of tke-symbionfr e£methauogen* (synLM2) 
The four agents in this example are chosen in a random way with maximum distribution in the 
comprehensive tree. 



25 



The results of this examplerarerveQf. premising Atfcfive bacteria; namely one- control and four test 
organisms, are placed in the correct phylogenetic_neighborhoods. The correctness of the placements is 
confirmed by the- positions- ef those^fe^erOTgaaisms-m-the comprehensive tree. 



24 



WO 02/059348 



PCT/US02/02564 



The control, E xoll at leaf node 7270 under branch node 7224 in the comprehensive tree, is unambiguously 
placed under branch- nede-7259-with-#. colt (itself); &eoH7 s aixkELCotirnGS-afr th$ee leaf nodes when 
probes at Qs 1.0 are used The best example of the four cases is probably Grttsuts6, which resides at leaf 
node 5404 under braneh-aederSSS&iiLthe^ placed under 

5 branch node 5391 with Ort tsuts9 at the only direct leaf node 541 1 of this branch node. Another particularly 
noteworthy and interestmgrcas&isr.the identification^ of syntJVEL The:sequence of tl^e 16S rRNA from this 
organism has only 359 nucleotides with one undetermined position. The correct placement of this 
prokaryote in the represefitati^fee^w^possibl&beeause semer signatufe^equences in its poorly 
sequenced 16S rSNA apparently remained intact and identifiable. 

10 

Although the prokaryotic or^nisms could be placeaVincorrect clusters, there were positive errors, i.e. some 

groups, which are not in the correct phylogenetic neighborhoods, were positively identified This kind of 

error occurs because many of- the signature sequences- used have a value of of les^than 1. The number 

of these false positive errors decreased as the probe quality Qs increased from 0.6 to 1.0, but as to a specific 

15 organism and a specific ^rebe<paMt^levetthererw^ the qrror rate between using 

nonameric and undecameric probes. Despite this imperfection, one. point should be stressed even though 

the false positives occur, the- correct payiogenetiG neighborhoods are among the groups identified in all 

cases.Moreover, the correct neighborhood is readilyidentifiedby the presence of multiple hits whereas the 

noise placements are-frequentLy loneK. This-is. a very important aspect of the method, which stems directly 

i 

20 from the parent/child relationship between nodes in a bifurcating tree. Thus, false positives are not a 
serious impediment to-succes&- laise negatives- are also-not- a probiembecaus^ of die redundancy of 
signature sequences that occur at many nodes. 

This example shows-thatwhen-a-smaU-set^f 16S r^l^-seo^nces-are arialyzed^at least some signature 
25 sequences exist that are representative of the phylogenetic groups that can be identified by tree 

constructions based on-the complete ICS rBHA sequences^ The consequence of haying thousands of such 
sequences in the dataset was not known in the prior art Possibly noise would build up to the extent that 



25 



WO 02/059348 



PCTYUS02/02564 



useful signatures would be obscured Even if such sequences continued to exist in the larger data set it was 
not clear that their numbers weuldteraefiatner-wasdtee^ identified 

The results establish beyond any doubt that characteristic oligonucleotides in the bacterial 16S rRNA 
sequence dataset do- in fact exisLinrhugpi numbers: Over 15,000 nonamexs^ioner^ere identified, with in 
many cases multiple coverage of the various phylogenetic groupings in the 92? organism representative 
tree. 

It is invaluable to identify these signature- sequences because a-group.-of evohitionar^ly related bacteria can 
be distinguished from other groups by a set of characteristic oligonucleotides specific to that group. The 
existence of these signatures- is- a di rect demonstration- of an- i nnate c hara cteristic of the evolution of 
bacterial 16S rRNAs that can be utilized to identify an unknown prokaryotic agent by elucidating its 
immediate phylogenetic neighborhood- These- characteristic- oligonucleotides canape used as the basis for 
developing hybridization probes that can be used in order design valuable oligonucleotide microarrays. 
Herein the utility of the signature sequences- was-teste&by insffieo hybridizations, usjng as unknowns 
sequences that had not been included in the original representative tree. These studies demonstrated that the 
characteristic oligonucleotides^ kv the unknown-organisms readily- rirovided their qprrect placement in the 
tree. 

This example by na means-Limits the invention- ta characteristic, angoflucleotidesjn 16S rRNA sequence 
dataset On the contrary, it encompasses many variations and specific improvements including, but not 
limited to the following: 

1. Use of new-data available- atrRQfr(bolk 16S-rRNA sequeness of release 8.1 and an 
updated prokaryotic phylogenetic trees). 

2. Improvements to the-regrei^tative^teee^ of prokaryotes in the 
comprehensive tree is represented by at least one bacterium in this tree. Where possible, merging of pairs of 
two closely related but not- fu&lenffifr,^^ of that tree region 
may be possible. It also may be useful to better weight the number of entries from various clusters. 



26 



WO 02/059348 



PCT7US02/02564 



3. Use of different but sensible functions to calculate the signature quality index. Since the quality index is 
the most important tooHor evahraftng^the st^mtre potential: of oHgQmieleQtide& this system, changing 
the function can have a substantial impact on the specific result 

4 Assembling and use of »x^«pi^^^tvfi-set *r£drararteri Rti^Qtignrarelctftfida^ by y/hich the majority of 
5 the groups and all of the important groups in the representative tree can be. identified. The oligonucleotides 
in this set are-likely to^have various lengths. 

5. Applying mathematical and programming techniques to facilitate the final interpretation of hybridization 
results . 

10 

Example 3- Soil Samples 

I6S rRNA is purified^oman-unknowii-or^anism.isc>Ia^ from-soil andamplified^ by RT-PCR using 
primers directed to conserved regions and flanking a variable region of the molecule. The PCR products 
1 5 are subjected to digestion-by a resMefioftendenuelease; fluoresceatly labeled witli c^5, and then hybridized 
to an array of all possible 8-mer peptide nucleic acids. After washing, the pattern of hybridization is 
observed by confocal laser fluorescence scanning, , and interpreted in^ terms, pf the known signature 
sequences for bacteria and the organism is assigned to the genus Nocardia. 

20 Example 4 -Soil Samples 

16S rRNA is purifiedfeont ar^unkno^ ^o using 
primers directed to conserved regions and flanking a variable region of the molecule. The PCR products 
are subjected to dlgestion-jay a-re^^ and then hybridized 

25 to an array of 5,000 DNA probes designed to recognize the 16S rRNA sequences of particular species. 
After washing, the^ pattern: o£4iybr±dizatiojtd^Gbsere scanning, and no 

significant hybridization is found The same labeled nucleic acids are then hybridizeo! to an array of 4,000 
probes to bacterial sifflateersegHencesidenHfiedby the methods^of tMs^mventien.. After washing, the 
pattern of hybridization is observed by confocal laser fluorescence scanning, and interpreted in terms of the 

30 known signature sequent fi^baetftr^ to the genus Bacillus. 

Example 5- Air sample 



27 



WO 02/059348 



PCT/US02/02564 



Nucleic acids isolated from an air filtrate are aliquot ed into 50 wells of a fluorescence microliter plate, each 
well containing a 5'-FTFG, 3*-queneher molecular beacon hairpin probe specific fc<p selected signature 
sequence. After heating to 95C for 5 minutes, the plate is allowed to cool slowly to room temperature, and 
5 fluorescence is react .Thecpattem;ja£ j^ of a strain of 

Staphylococcus. That is closely related to a known pathogenic strain. 

Example 6 - Mutated Protease 
10 Nucleic acids of a virus are isol^d-afidamplMed^&om ^bloedr sample and signature sequences are scored 

using the Qiagen Genomics Masscode sequence detection technology. The presence of particular signature 

sequences permits identification of a strain bearing a mutation- of a previousLyr|oiown protease, which 

confers on it resistance to particular therapeutic drugs. 

15 

Example 7- Meat sample 

Nucleic acids are isolated froma-meat sample claimed to be goose Ever and- signature sequences are scored 
using the Third Wave Technologies Invader directed-cleavage assay. The presence of a particular signature 
20 sequence indicates the presence of- turkey meat as an adulterant 

Example 8- Blood sample 

25 Blood taken from the-bed of a-ptekmrtruck ownedby a suspectedipoachec is^analyzed for signature 
sequences of mammalian mitochondrial DNA using individual hybridization assays detected by 
chemiluminescence produce&by^an alkaUne-phosphatase-wnjugated RNA/DNA r spedb5c antibody. The 
results suggest the blood comes from an animal of the genus Euarcturos y and the suspect is arrested on 
suspicion of poacmng.the American black bear. 

30 

Example 9- Air sample 

Nucleic acids isolated from an ak ftoate-are-aliquetedi microliter plate, each 

i 

35 well containing a 5'-FITC, 3'-quencher molecular beacon hairpin probe specific for a selected 18S rRNA 



28 



WO 02/059348 PCT/US02/02564 



signature sequence. After heating to 95C for 5 minutes, the plate is allowed to cool slowly to room 
temperature, and fluorescence is read The pattern- cf fluorescence is compatible wj|h the presence cf both 
a mold belonging to thegenus Stachybotrys and a fungus belonging to the genus Aspergillus. Two DNA 
oligonucleotides (one-5* bio tinylated) corresponding to two signature sequence&Jajujd in the sample are 
5 used in a PCR reaction to amplify a segment (of predicted length 46 nucleotides, based on the positions cf 
the signature sequences- ia-the 16S rRNA sequence) of rDNA The biotinyiated product is immobilized in 
single-stranded form and used as a probe for high-affinity, high-specificity detection cf a novel species of 
Stachybotrys. 



10 Example 10 

Nucleic acids of a virus-are is^ted^id-^opyfi^&omabloods^a^aiu^signature nucleic acid 
sequences are scored using the Qiagen Genomics Masscode sequence detection technology. Eight 
signature enzyme activities are also-assayedfot,. and two are found,, and-24 proteins whose presence can 
serve as signatures are assayed for by ELISA, and two are detected The combined presence cf particular 
1 5 signature sequences; activities,, andijrateinfrpermtteidm viral strain. 

Example 11 - Bioterrorism 

Air filtrate from a government building is collected and-nucleic acids-isolated. .i^NA is enriched using 
DNAse and RNA fragmented by heating Probes specific to several known bioterrorism agents give 
20 negative results. Molecular beaeoi^based:scorin|£.pf sisoafcurasequeneea reveals the presence of 
unexpectedly high concentrations of bacteria with genetic affinity to the genus Bacillus. Further 
investigation- reveals-anengineer^ anthracis* and- the building i^s evacuated It is noted 

that the prior art known to Applicants would fail to identify this engineered strain. 

25. 

MODIFICATIONS .. 

Specific compositions, methods, or embodiments discussed are intended to be onlyJUustrative of the 
invention disclosed by this specification. Variations on these compositions, methods, or embodiments are 
readily apparent to a person- of skill in the art based upon- the teachings-of this, specification and are 
30 therefore intended to be included as part of the inventions disclosed herein. Particularly preferred species . 
and ranges of parameters are partially summarized by Table G. 



29 



WO 02/059348 



PCT/US02/02564 



The nucleic acid sequences included in the database can be any ribosomal RNA, or a fragment thereof, or 
DNA encoding ribosomal RNA at afragmenfc thereafter. theDNA spacer regtpn between rRNA genes; or 
either the genomic DNA or RNA of viruses, or artificial RNAs, or any functional RNA molecule such as 
RNAse P RNA that is found-in- ausefeLvariefy of organisms. The moleeule- actually detected may be one 
5 that has a sequence related to the molecule represented in the database, for example PCR, NASBA or RT- 
PCR products^ derived-fronx rRNA -or rDNA 

Once identified, signature sequences will preferably be used in the design of hybridization probes. In this 
regard, the set of unique- sequences- of various- lengths are perfect signatures foe the-soecific or gani sm that 

10 they are found in and therefore are obvious candidates for use in the design of specific hybridization probes 
for that organism-IE a ned&i&iassQciate^^withi multipler&i^iatorersequences, as manv are in the case of 16S 

rRNA, it will be preferable to utilize the one or more with the most favorable hybridization properties. 

Depending on the ^vpA^v^t^ ; 1 ^^^ ^e^eG^g«>be^a^cefe£ab^ incorporate^ portion or all of either 
aparticular signature sequence or its complement There are also obvious mathematical relationships 

15 between the signature- sequences: ^different lengths, Thns^Borrexamgle^a 16 base^ignature sequence that 
is perfect for node N will necessary show up in the 8 mer signature set as 9 different 1 unique signature 
sequences for node^(te=-repieseatifig^ in the 16- 

mer). Therefore, one will be able to combine signature sequences in some cases to serve as a starting point 
in the design of longerj gEohes: Many signafattieT&eq^^ type, of relationship 

20 described above may still be sufficiently near each other in the primary sequence that it will be possible to 
combine them to design aJonger. prober This. catt be including a -"wildcard" 

hybridization base such as inosine at certain positions. More generally, a variety of non-standard bases can 
be used to miyj tfo t^t* y^^^<^ p^ Also the 

properties erf a signature sequence can be modified to adapt them for use with organisms represented by 

25 another node: tariff rtdffifr mftttnrn^ sequences can be 

modified to facilitate hybridization, or detectioa This includes but is not restricted to incorporation of _ 
fluorophores^ ehemjeal^laMlfcmfaeiies> reftfopett, jO£,hak)gfinTatoms^,Modglcauens-^ be incorporated in 
the course of replication by DNA polymerase or RNA polymerase. Labels can be incorporated in the course 
cfPCR, RT-PCR or NASBA 

30 

Detection can employ acvarie^{£know^^ hybridization 
and otherwise. Hybridization can be to RNA or DNA, but also to peptide nucleic acids, locked nucleic 
acids, brmchedrnueleiiieuae nucleic 
acids. Array formats (on single or multiple, e.g., bead supports) will often be valuable. Hybridization can 
35 lead to the-capture-ef aiabetedJ«*d^-aca&^ or array. Labels 

can be isotopes, chemically-detectable tags, liquid crystals, cleavable chemical tags, fluors, quantum dots, 
or enzymes such as aHeattragfrosgtetases ribo^roes^ or^peroxidase/. Eazymes^ean produce heat, color, 



30 



WO 02/059348 PCT/US02/02564 



fluorescence, chemiluminescence, precipitates, biohuninescence, changes in liquid crystalline order, or 
changes in nucleic acid structure: Bybrkfczation: ean alsaleadrterproduetion- of signals by self-quenching 
probes such as molecular beacons, or by ribozyme activation, FRET pairs, or changes in plasmon 
resonance or similar interfacial optical- phenomena; in mechanical resonant frequency, in redox activity or 
electrical conductivity, in electrophoretic or chromatographic mobility, in affinity for chelated metals, 
minerals, or antibodies or.pfotems^-otutrparticle or molecular mobility. Robotic methods of preparation and 
microliter plates can be employed with the invention to further automate multiple assays. 



The method of the invention is- especially usefiiLwheiv the-hybadiz^oniprebes consist of every possible 
10 sequence of one length. For example, there are 65,536 unique 65,536 octamers. The signature 

characteristics of every oafcc&theseioetomers^araobtame^ o f the: invention for any nucleic 

acid cf interest When the nucleic acid being used is 16S rRNA or 16S rDNA the same array can be used 
for any bacterial identifkratiflftJEEjgffl^ as there will be 

conflicting signatures. Only the sample preparation procedure would differ. The same array can also be 
15 used with any other nucleie_aeid Hence by changing strand genomic UNA 

cf the flarvjvirus family, the experimental results would be useful in identifying the closest known genetic 
relatives of the- test vans u^thts- vii^gr^pItris^anrimpoTtaan aspeetcfc&einveation^ that it is not necessary 
that all the oligomers in the array need work properly. There is frequently a high redundancy of signature 
sequences associatedrwith-a particular nade^soithafcifseveial fatV thfrnoderwittjs^ill give a signal if it is 
20 represented in the sample. 



Although signature- sequencei^wilL be: preferably be; used- ir^<^njtmctiftn^with^kybri^jgation methods of 
various types, it should be noted that these sequences also have unique physical properties. Therefore, if a 
plurality of signature- sequences-are^generated-by. experimental-means, e.g. Jby digestion with ribonuclease 

25 Tl or a restriction endonuclease, these physical properties can be measured Mass spectrometry which can 
comprise matax^siflted lam dram^ or resonance 

methods can be used to determine mass within 10%, more preferably 2% and most preferably 1% for each 
sequence. Likewise-appHcatioi^esast^^ m the^esign cf PGR primers 

to amplify larger regions of DNA or RNA For example, a completely unknown organism is detected by the 

30 method of the invention- ancbbest assignecktcfcaJatge: eariy braiiehtagigreropi The- proves that detected this 
affiliation could then be used as amplification primers to readily obtain a large regionfor full sequencing or 
as a longer probe. 



Although the invention- is preferredfor use withfunctional nucleic acids-it can-al§o be used with DNA 
35 sequences such as genes that encode protein. In this case, a database of genes for the equivalent protein 
from a sufficient number and-variety of organisms or viruses- would be needed. The tfee used might be 



31 



WO 02/059348 



PCI7US02/02564 



deduced from the genes themselves but in order to avoid possible complications of lateral gene transfer it is 
preferable to use a tree based on- i 6S. rRNA -sequence data 

When the invention is used with viruses, it is necessary to appreciate that all viruses do not share a single 
5 common ancestor. There are. many distinct groups e£virusesy ag. the^laviviridae, which is a large family 
of single stranded positive sense RNA viruses that includes the causative agents of yellow fever, St Louis 
encephalitis, Japanese encephalitis, hepatitis C, and Dengue fever. The genome is typically in the size range 
9,500-12,500 nucleotides some with DNA genomes and some with RNA genomes. Several common genes 
exist and hence raeaiiing^iL phylogenetic trees can be developed whiciv span the. entire group. Thus, it is 

10 possible to generate signature sequences that are specific for Dengue serotype type H or Dengue in general, 
etc. The methods of the invention can be used for any- vtrus-group as^ lon&as a n^eanrngftrl tree can be 
produced However, the sample preparation may require more steps. The different types of nucleic acid 
involved (single strandpositive sense RNA, double stranded DNA etc)~ma^limij the number of viruses 
groups that can be detected in one experiment 

1 5 Features preferred wttetibedtoeiitiottdB: eases- compriset .the nucleic acic^ is DNA that encodes 

ribosomal RNA or a fragment or a complementary sequence of the foregoing; the nucleic acid is RNA 
complementary to one of the strands-of the DNA that is hvthe spacer region-between ribosomal RNA genes 
or a fragment of the foregoing; the nucleic acid is DNA isolated from the spacer region between ribosomal 
RNA genes or a fxagment-of the j^egpmg.the nucleic acidis- any- non mRNA produced by the cell or a 

20 fragment of the foregoing; the nucleic acid is any mRNA produced by the cell or a fragment of the 

foregoing; the nucleic-acid is genomicDNA or afxagmentof the foregoing the signature quality index Q» 
includes terms that weight against false positives and false negatives; the tree contains some multiple 
branchings but is substantially bifurcating the genetic affinity of bacteria cfeukaryptic organisms is 
determined; the genetic affinity of more than one bacterial or eukaryotic organism can be determined in a 

25 single experiment; wherein the nucleic, acid is^DNA that encodes- ribosomal RNA or a fragment or a 

complementary sequence of the foregoing; the nucleic acid is RNA complementary td one of the strands of 
the DNA that is in the spacer region-between ribosomal RNA genes, or a t ragmenj of the foregoing the 
nucleic acid is DNA isolated from the spacer region between ribosomal RNA genes or a fragment of the 
foregoing where the nucleic acid is an^non-niRNA r^^uced b^.the cell or a-fra^ment of the foregoing. 

30 

Other preferred features-comprise; the nucleic acidis- any. mRNA produced by the ce^l or a fragment of the 
foregoingthe nucleic acid is genomic DNA or a fragment of the f oregoingthe genetic affinity of more than 
one virus can be determinedin a. single exr^rinaent;the nucleic acid is a ribosomal B^NA or or a fragment or 
a complementary sequence cf the f oregoingthe nucleic acid is DNA that encodes ribosomal RNA or a 
35 fragment or a complementary sequence. of the foregoing 

the nucleic acid is RNA complementary to one of the strands of the DNA that is in the spacer region 
between ribosomaLRNA genes. or aftagrnentof theforegoingthe nucleic acidis anj; non mRNA produced 



32 



WO 02/059348 PCT/US02/02564 

by the cell or a fragment of the foregoing, the nucleic acid is any mRNA produced by the cell or a fragment 
. of the foregoing; the nucleic acid is. genomic DNA or aifragmentcf the foregoing; the^ signature probes are 
of not all of the same length;the signature probes represent signature genes; choosing a tree of relationships 
that can be reasonably expeetedzto-sifpiify ^genefe-relattonshi^^^^ pr^viously published or otherwise 
5 generated by a third party; the hybridization probes are complementary or the same sense as the signature 
sequences; a plurality ef- a^aftimraquences isxombined into one on more. larger hybridization probes;a 
hybridization probe incorporates a portion of the information in a signature sequence; the signature probes 
are comprised of a nucleic acid analog comr^ingFNA, 2 --O-methyl-DNA or analog therecf;the presence 
or absence of a signature sequence in a test sample is determined by physical characterizationthe signature 
1 0 sequences are identified by the xnethod cf claim 1. 

physical characterization is done with mass spectrornetry;the nucleic acid molecule is a DNA molecule; 
the DNA molecule is a cDNA molecule. 

The invention may also be applicable in unexpected situations. For example, there are currently a large 
15 number of genomes being completely, sequenced^ When one- assembles phylogsnetically meaningful 
clusters of whole genome sequences there are certain genes that are highly characteristic cf particular 
clusters cf organisms. Tliesfvstgmtfttre gene.s cai^berusediifcthei^^ unknown organisms, 

preferably by detecting the presence of activities or gene products associated with the signature genes 
rather than a nucleic acid assay. 



20 



What is claimed is: 



33 



WO 02/059348 



PCI7US02/02564 



CLAIMS 

i. A method for obtaining signature probes comprising the steps of. 

A Compiling a database of nucleic acid sequences front a-substantially homologous region of an 

5 RNA or DNA comprising sequences from all organisms or viruses that will be incorporated into the 
analysis; 

B. Compiling a bifurcating tree that shows the genetic relationships.between the organisms 
whose nucleic acid sequences will be included in the analysis; 

C. Calculating the occurrence frequency and distribution of every oligoribonucleotide or 
1 0 oligodeoxyribonucleotide sequence of length N in the sequence database; 

D. Calculatmg-a-signatum <fii^ity function, which measures the extent tq which each particular 
oligoribonucleotide or oligodeoxyribonucleotide sequence of length N is characteristic of each node in a . 
substantially bifurcating^substantiallx vphytogerietic tree c£ genetic relationships; 

E. Selecting a oligoribonucleotide or otigodeoxyribonucleotide sequenceas a signature for a 
15 particular node if the quality index for said: sequence-has its greatest value for th^t node and the quality 

index exceeds a preset value; 

F. Synthesizing^sign^ture probes appropriate for- use in a hjforid^zation experiment that 
incorporate the node-specific information of the signature sequences. 

20 2. A method of claim 1 m-wtesh the signature quality ^index varies^fr^tO.O to l.O.and the preset value is 
chosen to be greater than 5.. 

3. A method of claim 1 in which the signature quality index Qs iscateuiatedby substantially the equation: 

Q, =(N gm /N ct )x(1-(Nm-Nc M )/Nm) 
-(.2W)/(N ot xN m ) 

in which where N M is the number of probe-matched organisms in the entire tree, Nqm is the number of 
25 probe-matched organisms in the-group cf intexest^and^G^is.thejiumber of organisms in the group under 
consideration. 

4. A method of detennining^the genetic affinity of organisms or viruses in-a testsample comprising the 
steps of: 

30 ADeriving a plurality cf nucleicaeidrsigpature probes froirta^database a? signature sequences 

that are able to hybridize to only a portion of the nucleic acid sequence of the organism or virus. 

B. Hybridizing the signature probes to the nucleic- acid obtainedfromthestest sample under 
hybridization conditions to cause those signature probes that are complementary to hybridize to 
35 the nucleic acid-ef ihe^^amsmic^vinis-andi^odaEe a detectable signal. 



34 



WO 02/059348 PCT/US02/02564 

C. Tabulating which signature probes produce a detectable hybridization signal 

D. Identifying theclosest Ira own geneticLtelatives of tiie organism ru: virus in the test sample by 
determining which nodes in the bifurcating tree of genetic relationship that was used to design 
the signature probes that produced the hybridization, signal. 

E. Identifying the organism or virus in the test sample as being contained within the most 
terminal noderrhafifr fiupportedby one- or more-; 



5. A method of claim 4 wherein the signature probes are comprised of a moiety selected from the group 
consisting of: RNA, DNA^ .an- anal c^olRNA or DNA including^eptide nucleic aci^s, 2-O-methyl DNA or 

1 0 any other molecule that can interact with the test sample nucleic in a sequence- specific way.. 

6. A method of claim 4 wherpwlhft-hyhri^ afeafam- selected frorn the group consisting 
of: an immobilized array of signature probes, molecular beacons, hybridization step done in solution. 

15 7. A method of claim 4 wherein therrifttpctiort stet^u^Hzes.n^oaciive.labds, cherniluminescence and/or 
fluorescence. 

8. A method of claim 4 ^kwetn a tres-nf iir4at4nnRhtpR.Ri£tti^gin£ £f*ne4ir> cdatisriship is generated by a 
standard method selected from the group consisting of parsimony methods, distance methods, and 
maximum likelihood 

20 9. A method of claim 4 wherein tka most nartewly defined ^oupmgS: on the tree o$ relationship comprises 
a moiety selected from the group consisting of: a specific genera, a specific species, a race, serotype, type 
or other grouping below the species level 

10. A method of claim 4 in which the signature probes are constructed by the method of claim 1. 
11.. A method of devising oligonucleotide probes for use in hybridization comprising using the sequence 
25 information provided in a signature sequence to construct the probe 

12. An isolated nucleic acid molecule comrj^sing the sequence shown, in Table B. 

13. The RNA sequence CUGCAGAGAUGA or the corresponding DNA sequence, and probes 
complementary to any of theJbregpingpr to-sequenees containingany.of .the foregoing, which are valuable 

30 for identification of samples containing organisms with strong genetic affinity to Legionella nautarum. 

14. The RNA sequence AAAAUCAUUCUC or the corresponding^DNA sequence^and probes 
complementary to any of the foregoing or to sequences containing any of the foregoing, which are valuable 
for identification of samples- containing^rgaiusms with strong genetic affinity to^ specific for organisms 

35 with strong genetic affinity to Listeria gray. 



35 



WO 02/059348 



PCT/US02/02564 



15. The RNA sequence CGGGAGGCAGCAGCU or the corresponding DNA sequence, and probes 
complementary to any of the foregoing or. to sequences containing any of thefbregofng, which are valuable 
for identification of samples containing organisms selected from the group of genera consisting of Borrelia, 
Brachyspira, Spiro^haeta and Treponema 

5 

16. The RNA sequence AUUAGAAACUGEJ or the-corresponding DNA sequence^, ^nd probes 
complementary to any of the foregoing or to sequences containing any of the foregoing, which are valuable 
for identification of samples-contaimng organisms with stron&genetic^affinity to Ureaplasma 

10 canigenitalium. 

17. The RNA f^g"^™* r^A^nATT^AA^TirTTTattH cxwG\CFXincZTnc±A*A whir* are substantially 
perfect signatures for node 4254 which contains various members of the genus Helicobacter and 
GGCGUGCGAGCGUGG which. i&^substantially perfect signature for node 3634swhich contains species 

15 of Isosphaera 

18. An assay or test kit comprising an RNA sequence selected from the group-consisting of 
AAAAUCAUUCUC, CGGGAGGCAGCAGCU, AUUACAAACUGU, GGAGGAUGAAGGUUU and 
GGCGACCUGCUGGAA or the, corresponding DNA sequence, and probes-comp^ementary to any of the 

20 foregoing or to sequences containing any of the foregoing 

19. A method of claim 4Jn^ wMeh the signatureprjobea lAngriv & on larger anywhere the nucleic acid 
is DNA isolated from the spacer region between ribosomai RNA genes or a fragment of the foregoing; 

25 20. All inventions contained herein. 



30 



35 



36 




Figure 1 

The bi-directional binary treec&truetore with. three-leaf nodes. Note that a parent m 
child nodes and each child node has a pointer back to its parent 



WO 02/059348 PCT/US02/02564 

2/24 




The elaborate structure of the composite hash usedin- th^progjaia Only two entries are shown in this 
figure. A hash is represented by a table and the keys are shaded 0 denotes the data type "undef" in Perl. 
The data in this hash are for elucidatory purposes only. 



WO 02/059348 



3/24 



PCT7US02/02564 




Figure 3 

The flow chart of tiia principal programs used. 



WO 02/059348 



4/24 



PCT/US02/02564 




Figure 3 (continued) 



WO 02/059348 



5/24 



PCT7US02/02564 



H E.colfenA3£b' 



LOCOS B.aolirnA3 3714 bp RNA RNA 09 -NOV- 19 9 8 

DEFINITION Escherichia coli etr. MG165S Cgene=rrsA gene] . 

REFERENCE . 1 

AinHORS Blattner.F.R.. , . Plunkett,G. ,XII., Bloeh,C.A. , Perna,N.T., Borland, v., 
Riley, M.. , Collado-Vides,tf . , Glasnex, J.D., Hoda^C.K.., ^ayhew,G.F. , 
Gregor ,J., Davis ,'?T. w: , KirRpa tr ieK , H*. A\ , Goeden ,W.A. , Rose , D . J . , 
Wait, B . • and*3hao,Y. 

TITLE The complete genome seguence of Escherichia coli K-13 

JOURNAL Science 277 (5331) , 1453-1474 (1997) 
COMMENT 

O~9**e«ponding GenBank- entry-: UQOOS« {ba«e&- to 4 034 6 SI) 

leg acy^aJLtrihufce^ CG. site Kid... 189 

qperon= rrsA gene 

iBoIate_*i8rae= MG1655 
53*6B COT?T 38? a 352 £" *8?- g- 315 U' 21™ OtirsrS" 

ORIGIN" 

1 AAADUGA A-GAGUO-U- GA-U-CAO-G 



3541 . -GQAQG-GGA -A-CCDG--C GOtt-.-UG-GA .DCACCaQCUa A— 
3661 



f 



] 



I 

readseq 



I*-' e.coiimA3,fastB r 



>E.colimA3 3714 bp RNA RNA 09-NOV-1998 , 3714 bases, 1504 checksum. 

, — — — ^^-AAACJUGAA-.GACUCT-tl- 

GA-U-CAB-G- - 

GUAGG-GGA-A-CCUG- -CGGU- -UG-GAUCACCUCCUUA 



fasta2f lat 



It. . m\m t A3.*s ?. :a con vwt?i J 



B.coli-roA3 



AAAUnQAAGAGUDUflAUeABG, « GQAGGGGAACCDGCGGUDGGMX2ACC0CCUUA 



Figure 4 

Subsystem I converts tbe-fermat of the sequence file as showit schematically above. 




Figure 5 

A phyiogenetic tree and its corresponding Newick format presentation. 



WO 02/059348 



PCT/US02/02564 



7/24 



(a) 

((A l B),((((C,D) f E),{((F.G);H) f 4»,lJ,K))) 




Figure 6 

Schematic illustration of how a tree file (shown in part a of the figure) in Newifek format is parsed in a 
stepwise and bottom up fashion (part b of the figure). 



WO 02/059348 



8/24 



PCT/US02/02564 




® 

© 

© 

© 




■c 



© 



Cut leaf (A) 




© 

® 



• — © 
>-© 
' — © 



Cut leaf (B) 



Cottaaf(O) , 



r-T® 

I — (Si 



® 

i — r~ @ 
^— © 

© 



1— © 



Cut leaf (F) JLf 



© 
© 

© 



I — | 

I © 



# Valid leaf node 
O Invalid leaf node 



Figure 7 

Schematic illustration of the trimming process that shows how it is stepwise an<d topology-conserving 



WO 02/059348 



PCT/US02/02564 



9/24 




WO 02/059348 PCT/US02/02564 

10/24 



MKons 




Figure 9 

The number of oligonucleotides and the length are related (a) The number of all possible oligonucleotides 
increases exponentially with the length. The curve is described by function/^ = 4*. (b) The numbers of 
the total and multi-occurring oligonucleotides in the 16S rRNA sequence dataset atep increase with the 
length. The increases are slower than that in (a) probably due to sequence constraint imposed byl6S rRNA 
structure and function. 



WO 02/059348 PCT/US02/02564 

11/24 

Figure 10 The representative prokaiyotic phylogenetic tree 
in Newick format 

(((((((((C<&ter.bark^^^ str. 227 DSM 1538' : 0.13226 ,/<^p.hungat> 

Methanospirillum hungatei str. JF1 DSM 864 (T)' : 0.16948 ) : 0.24421 , '<H£ volcani> Haloferax volcanii 
str. DS-2 ATCC 29605 (Ty : 0. 03648 ) : 0.09112 , ('<env.SBAR16> Santa Barbara Channel 
bacterioplanktonDNA clone SBAR16 1 : 0.19448 , , <Tpl.acidop>Thermoplasmaacidophilum str. 122-1B2' 
: 0.22004): 0.04224>: 0.10775 , ^Argful®^ Archaec^lobus fulgidcs str. VC-l^DSM 4304 (Ty : 
0.04075 ) : 0.05544 , C < ^Mb.formici> Methanobacterium formicicum DSM 1312* : 0.03067 , , <Mtfervidl> 
Methanothermus fervidus' u0.19624>: &01978 >: 0.0947 , , <Taceler>Thermo<{occus celer str. VU 13 
DSM 2476 (Ty : 0.00981 ) : 0.05532 , C<Mc.vanniel> Methanococcus vannieiii str. EY33' : 0.02484 , 
'<Mc.jannasc> Methanococcm jannaschii str. JAL-1 DSM 2661 (Ty : 0.1614 ) : 0.0P857 ) : 0.02807 , 
, <Mpy.kandlI> Methanopyrus kandleri str. avrl9 DSM 6324 (Ty : 0.09845 ) : 0.02703 , C<env.pJP27> Mud 
Volcano area of YellowstaneJ&E (?'Black Pool 1 . 1 ) hotspring DNA clone pJP27 ! : 0.06783 , 
(C<env.SBAR12> Santa Barbara Channel bacterioplankton DNA clone SBAR12' : 0. 1046 , '<env.pJP&9> 
Mud Volcano area of YellowstonaNE.g'Black^ool") liot^ringXlNA clone pJ0P89 h : 0.28523 ) : 0.01132 , 
C<Tmf.penden> Thermofilum pendens str. Hw3 DSM 2475 (Ty : 0.04404 , C<Sul.acaida^ Suifolobus 
acidocaldarius str. 98-3 ATCC339Q9 <xy.:.O..O4024 y - t <Thptenax> Thermoproteiy tenax 1 : 0.15875 ) : 
0.02106 ) : 0.09273 ) : 0.20883 ) : 0.03789 ) : 0.31178 , C<Aqu.pyroph> Aquifex pyrophilus str. Kol5a' : 
0.20649 , (( , <Ttmaritim> Thp.rmotoga marittmastr MSB8-DSM.3109 (Ty : p. 01001 , '<Fer.island> 
Fervidobacterium islandicum str. H-21 DSM 5733 <T) ! : 0.16351 ) : 0.23062 , ((C<MeLruber4> 
Meiothermus ruber stE_Lc^LnavaJn. AXCC35948 (Ty . l .014908. ^'<Dradiodu£> Deinococcus radiodurans 
ATCC 35073' : 0. 19907 ) : 0.08298 , C<Cfx.aurant> Chloroflexus aurantiacus str. J-10-fl ATCC 29366 (T)' 
: 0.1976, , <a , mc.roseum>Thei^mic^^ 22502-(T)V : 0.36297 ) : 0.11213 ) : 0.01165 , 

((((((((((((('<Acp.laidla> JAl* : 0.11002 ,'<Cramosum> Clostridium ramosum 

str. 113-1 ATCC 25582£Ey l 03Q77.4 0.00736. ^<M.capricok> Mycoplasma-C^pricolum ATCC 27343 
(T) [^6=™^]* : 0.38452 ) : 0.10528 , '<Stc.therm3> Streptococcus tliermopliilusDSM 20617 (T)' : 
0.05073 ) : 0. 1 5065 .^KEcOvf aecal> Eiiterococci^faecalis!- : a0306>: a 01-73 a, ('<L.casei> LactobaciUns 
casei subsp. casei ATCC 393 (T)' : 0.13937 , ^L.delbruck^ Lactobacillus delbrueckii subsp. delbrueckii 
str. Calvert ATCC 9649 (T>- ^O.04&09.>:. GJ01832-) : 0.02217. ,. , <Lis.monoc3> Listeria monocytogenes* : 
0.02418 ) : 0.0404 , , <B.cereus4> Bacillus cereus IAM 12605 07 : 0.06989 ) : 0.0034 , C<B.subtilis> 
Bacillus subtilis str. 168' : 0.05G5L ,J.^steai^>Bacmusjstea^ NCC|0 1768 (T)' : 0.05959 ) 

: 0.0075 ) : 0.12658 , '<Eub.barker> Eubacterium barken ATCC 25849 (Ty : 0.287$ 1 ) : 0.0097 , 
C<C.quercico> Clostridium quercicolumAlXX.25974.(^ :.0..13519. ^. , <BeLchlpr2> Heliobacterium 
chlorum ATCC 35205 (Ty : 0.1075 ) : 0.01024 ) : 0.01183 , C<Fus.nuclea> Fusobacterium nucleatum 
subsp. nucleatum ATCC2558&eT)l:~a.^^ : 0.06051 , 

C<Cor.xerosi> Corynebacterium xerosis ATCC 373 (Ty : 0.10315 , C< B ^bifidu>Bir5dobacterium bifidum 



WO 02/059348 PCT/US02/02564 

12/24 



ATCC 29521 (Ty : 0.29842 , '<Arb.globit> Arthrobacter globifonnis str. 168 DSM 20124 (Ty : 0. 12957 ) : 
0.06797 ) : 0.0074a ) : 0.3137 ) : 0.01738 ) : 0.00511 , C<Cleptuixi> Clostridium tectum ATCC 29065 (T)' 
: 0.16126 , C<Cbutyric4> Clostridium butyricum str. E.VI3.6.1 NQDMB 8082* : 0.06037 , *<C.pasteuri> 
Clostridium pasteuriamim ATCC 6013 (T)' : 0*07626): 0.38023 ): 0.02432-).;. 0.01262 , 
(((((((((C<Rub.g0lat2> Rubrivivax gelatinosus str. ATH 2.2. 1 ATCC 17011 (T) 1 : 0.07169 , '<Spr.voluta> 
Spirillum vdlutans ATCC 19554 (T)' : 0.06661 ) : 0.00462 /<EU^.purpur> Rhodbc^yclus purpureus str. 
6770 DSM 168 (Ty : 0.04015 ) : 0.02165 , , <Nis.gonorl> Neisseria gonorrhoeae str. B 5025 NCTC 8375 
(T)' : 0.19789 ) : 0.01431 , i <Ste.maitop> Stenotrophomonas maitophilia ATCC ^3637 (T)' : 0.24098 ) : 
0.02299 , C < E coli> Escherichia coli [gene=rrnB operon]' : 0.05825 , '<Ps.aerugi3>Pseudomonas 
aemginosaDSM 50071 (Ty : a 63646): 0.03524): 0.04488 , '<AlmYino&m> Allofhromatiiun vinosum 
ATCC 17899 (T)' : 0.0233 ) : 0.04869 , '^khalcl^ Halorhodospira halochloris str. A ATCC 35916 (Ty 
: a05948> : 0.08019 , (C<Rrubirum3>Rliodo^)irnlum rubrum str. ATH 1.1.1; S.l ATCC 11170 (T)' : 
0.04904 , , <Azs.brasi2> Azospirillum brasilense str. Sp 7 NCIMB 11860 (Ty : 0.3086 ) : 0.01343 , 
(C<Ricprowaz> Rickettsia prowazekii str. Breinl ATCC W-l 42. (T) (alpha purpte^ bacterium) 1 : 0.1406 , 
'<Spgcapsul> Sphingomonas capsulata ATCC 14666 (Ty : 0.13872 ) : 0.02068 , C<Rhb legum8> 
Rhizobium leguminosarum IAM 12609 (T> r : 0.01576 , C<Bdr.iapQni> Bradyrhizob^um japonicum LMG 
6138 CD 1 : 0.05736 , '^Rm. vanniel> Rhodomicrobium vannielii str. EY33 ATCC 51194' : 0.093 ) : 0.04263 
) : 0.00617 ) : 0.03466 ) : 0.06772): O00546 , (C<Myx.xanthu>My?cococcus l xanthus str. DK1622' : 
0.11263 , *<Dsb.postga>Desulfobacter postgatei str. 2 ac 9DSM 2034 (Ty : 0.19098 ) : 0.01154 , 
C<Dsv.desulf> Desotfovil^^ :jDl.^)1563 , C<Bde.stolpi> 

Bdellovibrio stolpii str. XJKi2 ATCC 27052 (T)* : 0.05967 , C < Camjejun5> Campylobacter jejuni subsp. 
jejuni str. TGH 9011 ATCC 43431' : 0.Q1753 , C<Wln,succi2> WolineUasuccinogenes str. 602W (FDC) 
ATCC 29543 (T)' : 0.05551 , '<Hlb.pylor6> Helicobacter pyiori ATCC 43504 (T)' : 0.02351 ) : 0.18884 ) : 
1.11671 ): 018947 ) : 0.01602 ) : 0.15633 ) : 0.01513 ,.((C(C(C<^ip ft p^d>Trapon^nmpaiUdum str. 
Nichols* : 0. 14543 , ! <Spi.stenos> Spirochaeta stenostrepta str. Zl ATCC 25083 (Ty : 0.03623 ) : 0.03698 , 
'<Box±airgdo> Borrelia burgdorferi str. B31 ATCC 35210 (T)' : 6.3604 ) : 0.(J859 B '<Spi.haIoph> 
Spirochaeta halophila str. RSI ATCC 29478 (T)' : 0.02473 ) : 0.01206 , '<3rs.hyodys>Brachyspira 
hyodysenteriae str. B204 ATCC 31212* : 043546 ) : 0.04129 , C<I,pn.illini> Leptonema mini str. 3055' : 
0.07041 , '^Lps.inter^ Leptospira interrogans str. Kennewicki, serovar pomona 1 : 0.16902 ) : 0.05013 ) : 
0.01817 , C<#fosucS85>Fibrobacter succinogenes subsp. suecinogenes str. 1^85 ATCC 19169 (Ty : 
0.23142 , '<Acbtcapsl> Acidobacterium capsulatum str. 161' : 0.21099 ) : 0.03073 ) : 0.0094 , 
((((('<Syn.6301> Synechococcus sp. PCC 6301' : 0. 12285 , '<Nost.muscr> Nostoc rnuscorum PCC 7120* : 
0.06977 ) : 0.01225 , (*<Zea_mays_C> Zea mays (maize; corn; Indian corn) — chloroplast* : 0.145 , 
•<01sthtfG> Olisthodiscus luteus (stramenopile) - chloroplasf : 0.3525 ) : 0:09491 ) : 0.012 , 
'<Glbviolac> Gloeobacter violaceus PCC 7421' : 0.07279 ) : 0.01171 , C<env.MC18> Mount Coot-tha 
region (Brisbane, Australia) 5-10cm depth soil BNA clone MC 18' : 0.01409% C<Chdpsitta> 
Chlamydophilapsittaci str. 6BC ATCC VR-125 (T)' : 0.36004 , '<Pir.staley>Pirellula staleyi ATCC 27377* 



WO 02/059348 




PCT/US02/02564 



13/24 



: 0.34247 ) : 0.25993 ): 0.1121 ) : 0.03258 , C<ChLlimico> Chlorobium limicola str. 832T : 0.1389 , 
C<Tmalapsuni> Thermonema lapsum ATCC 43542 (Ty : 0.0332 , C<Fix.litara> piexibacter litoralis str. 
Lewin SIO-4 ATCC 23117 (T) 1 : 0.01576 , C<Qy hutchin> Cytophaga hutchinsonii str. D465 (P.HA, 
Sneath) ATCC 33406 (T)' : O.0073 , ('<3>rb.difflu> Persicobacter diffluen* str. Lewhi L1M-1 ATCC 23140 1 
: 0.00585 , ('<Sap.grandi> Saprospira grandis ATCC 23119 (T) 1 : 0.02768 , C<Flx.canada> Flexibacter 
canadensis ATCC 29591 (Ty : 0,03254 , CC<BaafiragLl>Bacteroidesfragilis.ATCe 25285 (T)' : 0.04826 , 
•<Prv.nuncol>PrevoteUaruminicolasubsp. ruminicola ATCC 19189 (T) 1 : 0.20539 ) : 0.02821 , 
C<Qr.lytica> Cytophaga lytica str . LIM-21 ATCC 23178 (T)' : 0.14365 , , <Emb.j?revi2> Empedobacter 
brevis ATCC 14234' : 0.0913 ) : 0.35994 ) : 0.12199 ) : 0.33291 ) : 0.47588 ) : 0.14622 ) : 0.18424 ) : 
0.08878 > : 0.30465 ) : 0.05104 ) : 0.00825 ) : 0.02261 ) : 0.00329 > : 0.56238.) ; 0.52312 ) : 0.05444 ) : 
0.31178 ); 



WO 02/059348 



PCT7US02/02564 



14/24 



j- anv.pJPBS 
1- oay.SHARia 



M&vsttW 

Tcxcetw 

MLfcrvldl 




Figure 11 The graphic view o£ the representative, prokarvotic phvlogenetic tree. 



WO 02/059348 



PCT/US02/02564 



15/24 



557* 



• Acidocelxa faciiis [ftcc. xacii2] * 



5575 



5577 



— Acidlphi Hum anguatnpi [&cdp . angu2 ] * 
S56S 

L— ftcidi phi i ium aei.dopiil3.tyn [Acdp . acph.1] * 

— ftnidipfiilium mu3 tivoryn [Acdp . mi tvx] * 
5573 

— Amdiptiilinm oxganovorup [Acdp . organ] * 



5SS7 



iter dlaz atropfa i cv^s [Gab.diaztr] * 



5556 



5549 



E— Gluconacetofaacter aeyiinua [Safe -xyl sue] * 



methanol 1 aa [Adm.metha23 



— & cetobac ter pa s teur i any [Aba .paster] 



554J 



- — Acetobacter acetyl [Aba. ace ti 2 3 * 



5547 



—» Gluconobacter cer±ni}s [Gb . cerinua] * 
5545 

L— Gluconobacter frateuri|. t<3b. frateur] * 



Figure 12 



A local region of the representative tree following trimming from 3& to 12 sequence^. The branch numbers 
in the representative tree are labeled in the picture and can be correlated with the results given in Table F. 
The complete representative tree.is on the CD that is attaphed to this application. 



Table A. 



WO 02/059348 




PCT/US02/02564 



16/24 



Five best Q, scores for ISrners that occur at least twice in the 16S rRNA data set Files containing 
complete tables of this type are given for various sized test sequences on the CD tfa^t is included with this 
application. Sequences that never occur or are specific signatures of an individual organism are not 
included in these lists. (Only a representative portion of the sequence listing is shown here) 
Sequence NodeNum QualityValue 



AAAAAAACAGUCUC^2815 0.5 

AAAAAAACAGUCQGA 283 1 0.5 

AAAAAAACAGUCUC^2836 0.44 

AAAAAAACAGUCUCA 2839 0.4 

AAAAAAACAGtrCUC^2865 0.33 

AAAAAAAGACGGUAC 2064 1.0 

AAAAAAAGACGGUAC2072 0.67 

AAAAAAAGACGGUAC21 07 0.29 

AAAAAAAGACGGUAQ2108 0. 10 

AAAAAAAGACGGUAC 213 7 0.0? 

AAAAAAAUGACGGU43770 0.1 

AAAAAAAUGACGGUA 3069 0.1 

AAAAAAAUGACGGUA.2027 0.07 

AAAAAAAJJGACGGUA2023 0.07 

AAAAAAAUGACGGU^ 1 780 0.07 

AAAAAACAGUCUCAG 2815 0.5 

AAAAAACAGUCUCA02831 0.5 

AAAAAACAGUCUCAG 2836 0.44 

AAAAAACAGUCUCAG 2839 0.4 

AAAAAACAGUCUCAG 2865 0.33 



etc 



WO 02/059348 



PCT/US02/02564 



17/24 



Table B 

Organism- specific sequences. Bach of these sequences is. uniquely found in the indicated argansism A file 
containing a complete table of this type for sequences of length 12 can be found on the CD that is included 
with this application. Similar lists of unique sequences caa be generate^ for any length. (Only a 
representative portion of the sequence listing is shown here) 



AAAAAAACCApU 
AAAAAAACGUGC 
AAAAAAAGUWC 
AAAAAAATJAAAA 
AAAAAAAU<5AAG 
AAAAAAAUUAGG 
. AAAAAAAIIEJUAU 
AAAAAACACGUC 
AAAAAACCAACC 
AAAAAACCAAUC . 
AAAAAACCACUC 
AAAAAACCCUUC 
AAAAAACCQJ3CC 
AAAAAACCGGUC 
AAAAAACGUQCC 
AAAAAACUAAAG 
AAAAAACUC5JGC 
AAAAAACUGACG 
AAAAAAGAAGCA 
AAAAAAGAGUGG 
AAAAAAGCCQAC 
AAAAAAGCCGUC 
AAAAAAGCCUUA 
AAAAAAGGGGGA 
AAAAAAGUU^UC 
AAAAAAGUUUCG 
AAAAAAUAAAAC 
AAAAAAUACUCC 
AAAAAAUAGAGU 
AAAAAAUAUGUC 
AAAAAAUCAAAA 
AAAAAAUCAAAU 
AAAAAAUC^AUC 
AAAAAAUCCAUC 



Organism 

Mmycoide6 

C.spAZ3_Bl 

BucaphUso 

Buc.aphUso 

Nostmuscr 

M.floccul2 

BucaphCvi 

Eub.cellu2 

C.argenti3 

Gsubterm2 

B. pallidus 
Tms.chilns 
Nsp.marinl 
Trb.tumes2 

C. spAZ3_Bl 
BucaphCvi 
env..DA052 
env.OPB92 
BucaphCvi 
Mmlo.WX 
Pps.octavi 
Eub.rumina 
symCamnhe 
BucaphCvi 
Cow.ruminS 
BucaphUso 
BucaphUso 
str.l6SX-l 
M.capxico6 
Cam.graci2 
Acp.oculi2 
M. conjunct 
M-mlaWX 
envAspo3 



WO 02/059348 



18/24 



PCT/US02/02564 



Table C 

The program subsystems and their filiations and components. 



Subsystem Function C omponents 



I 


Sequence file format conversion 


readseq 
fasta2flat 


n 


Internal data structure preparation 


seq_ciassifier 

tree_parser 

seiectjseq 

^robe_hash_table^generator 


ra 


Function value calculation 


i 

caic_node_value 


IV 


Result presentation 


result_printer 

resultjprinter_ 

group_node_lister 

Hst_hit_branch_nodes 

hybridize- 



WO 02/059348 



PCT/US02/02564 



19/24 



Table D 

The numbers of oligonucleotides of different lengths. 



Oligomer 


Hexamer 


Heptamer 


Octamer 


Nonamer 


Becamen 


Undecamer 




6 


7 






10 


11 


N tpt 


4,096 


16,384 


65,536 


262,144 


1,048,576 


4,194,304 




4,096 


16,340 


57,023 


125,990 


186,781 


228,995 




4,096 


16,324 


48,295 


76,376 


86,856 


91,652 



N Lp . - number of total possible oligonucleotides of length n. 

Ntiea - number of total oligonucleotides from selected valid 16S rUNA sequences. 

Nm.igs. - number of multi-occurring oligonucleotides from selected valid 16S rRNA sequence. 



WO 02/059348 PCT/US02/02564 

20/24 



Table E 

The number of signature sequences that- were found at various quality levels as afUnction of length 



Signature' 
length 


NerabeF of sig 


matures at Quality level O fi T 


Phyicgeaetic groups 


-1.0 




>0.6 


coverage (%)* 


5 


35 


482 


674 


1.99 


6 


0 


371 


680 


4.29 


1 


4 


372 


urn 


24.35 


8 


457 


1,722 


6,168 


65.39 


9 


2,533 


5,580 


15,34j) 


79.48 


10 


5,016 


9,212 


21,919 


82.39 


11 


6,788 


11,607 


25 y 86p 


83.15 


15 


10,487 


16,629 


39,502 


86.37 



t Only signatures that can identify phylogenetic groups with three or more members are counted. 
% The coverage is calculated by a computer program. Any branch nodes other than, tfyose that have two leaf 
nodes as their two child nodes in the representative tree are regarded as phylogenetic groups (635 in total). 
The signature quality Q, is greater than 0.6. 



WO 02/059348 



PCT/US02/02564 



21/24 



Table F. The numbers cf nonameric, undecameric, and 15-nier signature sequences at different branch tree 
nodes (seeMgurel2) in different ranges of signature quality leveL 



Branch 
nade 



Number of Nonameric^ Undecameric* and fifteenmeric Oligonucleotides Sequences in 



number 






0.8) 


. X 


9 


11 


15 




9 


11 


15 




9 


11 


15 


5543 


77 


12. 


2d 


33. 


. a 


a 


0 


0. 


. 36 1 


5 


12 


19 


5545 


176 


26- 


58 


92 


a 


a 


0 


a 


163 


25 


51 


87 


5547 


0. 


a 


Q 


0 


; 22 


4 


to 


13 


98 


10 


36 


52 


5549 


- 14 


4 


5 


5 


33 


5 


14 


14 


183 


24 


62 


97 


5556 


47 


6 


13- 


28 


. a 


Q 


0 


0. 


29* 


2 


8 


19 


5557 


. 19 


1- 


8 


10 


61 


9 


28- 


24 


113 


27 


36 


55 


55.65. 


298 


42i 


99- 




9: 


Q 


0 


0 


108 


24 


46 


38 


5573 


- 419 


. 42 


136 


241 


o. 


o. 


0 


0 


139. 


32 


50 


57 


5575 


90 


12. 


30- 


48 . 


165 


24 


48. 


93 


. 102 


23 


39 


40 


5576 


. 93 


11 


28 


54- 


134 


22 


49 


63 


154 


27 


47 


80 


5577 


17 


5 


& 


6 


61 


15 


21 


25 


109 


26 


41 


42 



WO 02/059348 



22/24 



PCTYUS02/02564 



Table G 



Parameter 


Preferred 


More Preferred 


Most Pref 


Input Sample 


Body Fluids (blood,* urine, saliva, 
sputum, sperxty biopsy sample;feces); 
Agricultural Products {grains; livestock, 
vegetables* ^to-X soil, air particulates; 
PGR products; natural waters, 
contaminated liquids; surface scrapings 
or s wabbings, Artrmai SiMA, ceil 
cultures, virus-mfected cultures, 
microbial colonies 


Body fluids, 
.agricultural 
products, microbial 
colonies, PCR 
products 


Body fluids, 

PCR 

products 


.Target .organisms 
per sample 


1-100 


2^20 


1-2 


Target seauenee 
type 


SSURNAs,XSU-rRNAs 5 §S fRNA, 
spacer region DNA_from rRNA gene - 
clusters, rRKA, 4.SS fRNA, 10S 
RNA, I^AseP RNA, guide RNA, 
teloinerase KNA,'silRNAs -e.g. Ul 
RNA etc^ scRNAs* Mitochondrial 
' SNA, 

Virus DNA, virus RNA 

PCR product, human DNA, human 

cDNA, artificial SNA 


I6S rRNA, Virus 

RNA^. Virus. DNA 

rlxMA gene cluster 
spacer region DNA 


16S fRNA 


Organism 


Bacterium*, virus, plants animal, 
fungus, yeast, mold, Arcliae; 
Eukyarotes? Spores^ Fish; Humant 
" Gram-Negative bacterium, Y.' pestis, 
HJV1, B. anthracis r Smallpox virus 


Bacterium v Archaea, 
eukaryodc 
mieroorganisms v 
vims 


Bacterium 


Nucleic Arid 


Chromosomal DNA; rRNA; rDNA; 
cDNA^mt^&NA^cpg^afrNA, 
plasmid DNA, oligonucleotides; PCR" 
product; Viral RNA^ Viral DNA; 
restriction fragment; YAC, BAC, 
eesmid^ 


rRNA, ViraTRNA, 
Viral DNA 


rRNA 











WO 02/059348 



23/24 



PCT/US02/02564 



. equence length 


.20-20,000 


100-12,000 


500- 2,500 


Probe length 


5to2§8& 


7to2& 


10 to 20 


Number of probes 


2-100;000,000 


20Kt0fr,00fr 


50-10,000 


Oassifrcatioa 
LeveL. 


Kingdoms Phylum; Class; Order; 


Genus; Species. 


Genus, 
Species 


Family; Genus; Species; Subgroups; 


Strain 


Strain, Tribe. Serotroe; Gram stain 




Utility 


Research; Adulterant Detection; 
Counterfeit Detections Food Safety; . 


Biodefense; 
Adulterant Detection 


Clinical 
Diagnosis 


Taxonomic Classification ; 
FnviYnnmenfaf'MfmHortnu; 
Agronomy: Law Enforcement 




Sample 

preparation Agent 


acid, base r detergent phenoVethanoI^ 
isopropanor, cfiaotrope, enzyme, 


Polymerase, 
restriction 
f endonuciease^ 


Polymerase, 
phenol 


restriction endonuciease, detergent 


phenol 


Sample Preparation 
Fretreatment 


Filter, Centrifuge, Extract, Adsorb, 
protease, nuclease, partition, wash, 
leaclu ty se r etectir^horesis. precipitate r 
germinate, Culture 


Filter, centrifoge^ 
culture 


Filter, culture 


Hybridization 
Media 


Agueou^ buffer, solution containing 
formamide, zwitterion solution, heated 
solirti any alcohol solntrotr ■ 


A queous buffer , 
solution containing 
: formmnide, heated ( 
solution 


Solution 

containing 

formamide, 

heated 

solution 


Cultivation Media 


LB, M9, blood agar, DMEM, calf 
gggqror iuediu4iyM^€^i^y f s-u*c<lfwt*^ 
Culture medium containing, host cells 


LB, blood agar, 

" ■* ■! Iff gT*if**kH^« luittt- 

; containing host celjs 


Blood agar 


Separation media 
for sample 
preparation 


Jfm exchanger, filter, nltraftlteF, depth- 
niter, multiwell filter, centrifuge tube, 
- iHsmsbifefffl^iHstal affinity* adsorbent, 
. hydroxyapatite, silica, zirconia, 
magnetic beads- 


Ion exchanger, 
multiwell filter, 

meg vsssfssssfs iimsMn 

. affinity adsorbent, 
hydroxyapatite, 
silica, magnetic 


Ion 

exchanger, 
silica, 
magnetic 
beads 


Q, Minimum 


0,5-1.0 


>0,7 


>0.9 




*>Q.7 


>0.S >QS 



WO 02/059348 



24/24 



PCT7US02/02564 





<(L3 


«U5- 


O.08 


Detection Means: 
(Probe 

Hybridization): 


ftSassSpee. ;SIuoreseenGe; 
ChemUuuihiesence; Enzyme Reaction; 
Radiochemical; Self-quehching Probe 
. hybridization;. Surface Plasmon 
Resonance;: Total Idteruid Reflection' 
Fluorescence ^Liquid Crystals; 
Magnetic; Infrared; Array Detection 
.Peptide Nucleic Acid hybridization; 
Branched ONA hybridization;" Redox 
Chemistry t LNA hybridization 


1 




' Detection Means: 
. (NoiihybridizatioxL 
Methods: 


Mass Spectrometry; Electrophoresis; 
Affinity electrophoresis; 
Chromatography, BBKLC; Neutron 







This Page is Inserted by IFW Indexing and Scanning 
Operations and is not part of the Official Record 

BEST AVAILABLE IMAGES 

Defective images within this document are accurate representations of the original 
documents submitted by the applicant. 

Defects in the images include but are not limited to the items checked: 

□ BLACK BORDERS 

» 

□ IMAGE CUT OFF AT TOP, BOTTOM OR SIDES 

□ FADED TEXT OR DRAWING 
£0f«LURRED OR ILLEGIBLE TEXT OR DRAWING 

□ SKEWED/SLANTED IMAGES 

□ COLOR OR BLACK AND WHITE PHOTOGRAPHS 

□ GRAY SCALE DOCUMENTS 

□ LINES OR MARKS ON ORIGINAL DOCUMENT 

□ REFERENCE(S) OR EXHIBIT(S) SUBMITTED ARE POOR QUALITY 

□ OTHER: 

IMAGES ARE BEST AVAILABLE COPY. 
As rescanning these documents will not correct the image 
problems checked, please do not report these problems to 
the IFW Image Problem Mailbox. 



